How to Ensure your MongoDB Clusters can Survive Amazon AWS Outages?

If you’re hosting your MongoDB cluster in Amazon AWS US-East region, the last month has been fairly interesting – two outages in four weeks has tested the operational readiness of your cloud deployments. As I type this blog post, the Sao Paulo region is also experiencing connectivity issues. A surprising number of production databases did not survive the AWS outage. We had the opportunity to talk to a number of MongoDB on AWS customers to understand how the outage affected their deployments. I took a quick survey of affected individuals, and here are the four main reasons teams experienced downtime:

  1. Running a standalone instance vs. a replica set

    If you’re running a production MongoDB server, there’s really no excuse to run a standalone instance vs. a replica set. Create a replica set so that you can have a secondary to failover in case of primary failure.

  2. Not distributing replicas across availability zones

    Ensure that you distribute your replicas across different availability zones in a region. This way, if a single AZ goes down, as it happened two times this month, your remaining servers will take over and you will have a functioning cluster. If your region has only two AZ’s, place your arbiter in a different region. This will however not help you if the entire region goes down. If you want to survive entire AWS region failure you will need to distribute your replica set across different regions.

  3. Not distributing your front-ends or app servers across availability zones

    Make sure you distribute your front-ends across different availability zones. There is no point having your database up and running if your front-end is down. If you have cost issues, you can keep an up to date front-end ‘stopped’ in each AZ which you can turn on in case of a need. Another option is to have smaller size front-ends.

  4. Connect to the replica set vs. a single server in your connection string

    Make sure you connect to the replica set instead of a single server. The syntax is different for different drivers, but check your driver documentation to ensure you’re using the right syntax to connect to the replica set instead of a single server. This way, if there is a failover the MongoDB driver will do the right thing and connect to the new primary.

At ScaleGrid, we automate all the operational aspects of your deployment so you can focus on your app and not worry about operations.  When you create a MongoDB replica set with ScaleGrid, we automatically distribute the replicas across availability zones. Due to this distribution, all of our customers have been able to safely navigate the AWS downtime issue. If you’re interested in a more detailed read on the operational aspects of MongoDB, you can read my earlier detailed blog post – 10 questions to ask and answer when hosting MongoDB on AWS