Database Queries: How To Find a Needle In a Haystack?

Needle In A Haystack Loupe DrawingThe poster child scenario for big data – you need to sift through a large amount of data to extract a tiny “nugget” of information. Also, it needs to be done in the shortest amount of time as possible, your business depends on it. Historically, using traditional relational database management system (RDBMS) technology, this sort of scenario required a large team and a investment of time and money. Most traditional RDBMS’s only scale vertically, so you have to keep buying larger and larger machines to reduce your turnaround time. The advent of public clouds and NoSQL databases like MongoDB has completely disrupted how teams are thinking about this scenario.

One of our customers recently came to us with an interesting problem. They periodically needed to run a really complex query that scanned their entire data set. This query was pretty much a collection scan that touched every document in the collection, and included these details:

  • Total data was about 100GB.
  • Data safety was not a issue since the master copy of the data resided elsewhere.
  • Query speed was extremely important. The goal was to be able to run the entire query within 10-15 minutes.
  • The system needed to be up only when the query is running (minimize cost).

Due to the last requirement, it made sense to run the entire system on a public cloud. The machines get turned on for only a few hours every week for the data to get updated and the query to run. The customer was already comfortable with Amazon EC2, so the decision was made to prototype the system in AWS.

The best configuration to achieve this goal was a “sharded” MongoDB deployment. Here’s the configuration we settled on:

  • 3 shards – each shard has a standalone instance (r3.xlarge) with 30 GB of RAM
  • 1 config server
  • 1 shard router (m3.xlarge) with 15 GB of RAM

A couple of things to point out about our choices:

  • Standalone vs. Replica Set

    Data safety is not an important requirement here since the master data is stored in a separate system. Hence we went with standalone servers instead of a replica set to save on cost.

  • 3 Config Servers vs 1 Config Server

    ame reason as above. Data safety is not an important issue. In a typical production environment we would have gone with three config servers.

The real beauty of this configuration is that due to the sharded configuration, almost the entire 100GB of data is stored completely in-memory. So, essentially what you are running is an “in-memory” scan. This dramatically reduced the run time of the query from a few hours to less than 10 minutes. The use of the public cloud also dramatically reduced the capital investment since you only pay for the machines when they are running.

This is a fairly dramatic change to how teams have been handling this scenario over the past decade. So, if you’re in the “finding a needle in a haystack” business, think Cloud + NoSQL!

As always, if you have any questions you can reach us at [email protected].