Fast Paging with MongoDB

Paging through your data is one of the most common operations with MongoDB. A typical scenario is the need to display your results in chunks in your UI. If you are batch processing your data it is also important to get your paging strategy correct so that your data processing can scale.

Let’s walk through an example to see the different ways of paging through data in MongoDB. In this example, we have a CRM database of user data that we need to page through and display 10 users at a time. So in effect, our page size is 10. Here is the structure of our user document

{
    _id,
    name,
    company,
    state
}

Approach 1: Using skip() and limit()

MongoDB natively supports the paging operation using the skip() and limit() commands. The skip(n) directive tells mongodb that it should skip ‘n’ results and the limit(n) directive instructs mongodb that it should limit the result length to ‘n’ results. Typically you will be using the skip() and limit() directives with your cursor  - but to illustrate the scenario we provide console commands that would achieve the same results. Also for brevity of code the limits checking code is also excluded.

//Page 1
db.users.find().limit (10)
//Page 2
db.users.find().skip(10).limit(10)
//Page 3
db.users.find().skip(20).limit(10)
........

You get the idea. In general to retrieve page n the code looks like this

db.users.find().skip(pagesize*(n-1)).limit(pagesize)

However as the size of your data increases this approach has serious performance problems.  The reason is that every time the query is executed the full result set is built up, then the server has to walk from the beginning of the collection to the specified offset. As your offset increases this process gets slower and slower.  Also this process does not make efficeint use of the indexes.  So typically the ‘skip()’ and ‘limit()’ approach is useful when you have small data sets. If you are working with large data sets you need to consider other approaches.

Approach 2: Using find() and limit()

The reason the previous approach does not scale very well is the skip() command. So the goal in this section is to implement paging without using the ‘skip()’ command. For this we are going to leverage the natural order in the stored data like a time stamp or an id stored in the document. In this example we are going to use the ‘_id’ stored in each document. ‘_id’ is a mongodb ObjectID structure which is a 12 byte structure containing timestamp, machined, processid, counter etc. The overall idea is as follows
1. Retrieve the _id of the last document in the current page
2. Retrieve documents greater than this “_id” in the next page

//Page 1
db.users.find().limit(pageSize);
//Find the id of the last document in this page
last_id = ...

//Page 2
users = db.users.find({'_id'> last_id}). limit(10);
//Update the last id with the id of the last document in this page
last_id = ...

This approach leverages the inherent order that exists in the “_id” field. Also since the “_id” field is indexed by default the performance of the find operation is very good. If the field you are using is not indexed your performance will suffer – so it is important to make sure that field is indexed.

Also if you would like your data sorted in a particular order for your paging then you can also use the sort() clause with the above technique.  It is important to ensure that the sort process is leveraging an index for best performance. You can use the .explain() suffix to your query to determine this.

users = db.users.find({'_id'> last_id}). sort(..).limit(10);
//Update the last id with the id of the last document in this page
last_id = ...

As always if you have any questions or comments please feel free to reach out to us at support@mongodirector.com


Liked this post?

Join the ScaleGrid newsletter and never miss out!

Dharshan is the founder of ScaleGrid.io (formerly MongoDirector.com). He is an experienced MongoDB developer and administrator. He can be reached for further comment at @dharshanrg


  • alexdown

    I always used SQL LIMIT…OFFSET (or FETCH NEXT .. ROWS ONLY) without giving it a second thought.. Was I writing nonperformant queries, or it’s just that mongo do not (yet) cache query plans/executions?

  • dharshanr

    MySQL has the same performance issues with LIMIT…OFFSET – so I dont think it is a MongoDB specific thing

  • van quy bui

    The approach #02 has a problem:
    - When user goes directly to page n, you must define the last_id of page n-1, n-2,… 1. So database still scan (n-1)* pageSize item.

    • Bob

      In most of cases, you don’t show all pager links but just 10 or 20 no more as this example :
      1 2 3 … 10 | 1 … 5 6 7 … 10 | 1 … 8 9 10
      We do not scan all collections.

  • dharshanr

    Thanks Van. In this example we are walking through all the pages in linear order. However if you need to skip pages maybe you can guess the last_id based on page size? The goal is to use an index and prevent the database from walking through each page.

    • Bishal Dangal

      How to create such index? If I find last id for each block then wouldn’t it be the same as of skip()?

      • Miguel Martínez

        You can check the indexes with db.document.getIndexes(). So the id is index by default.

  • Hamed Farag

    Thanks A Lot Van, I Have Small Question,

    “Approach One useful when you have small data sets. If we are working with large data sets we need to use approach Two”

    So, What do you mean of “small data sets”, How much records as you think?

  • Nisarg Tuli

    Hey Darshan.. Approach 2 sounds good for moving forward. But what about going back a page? I am using the following query but my application would still need to sort the records back to ascending order. Can I do this on the db?

    db.Employee.find({_id : {“$lt” : ObjectId(“551bdc2044e0306bf344f95f”)}}).sort({“_id”:-1}).limit(10)

  • parag singh

    What’s wrong on login script . Plz help

    if(isset($_REQUEST['Login']))
    {
    $username = $_REQUEST['username'];
    $password = $_REQUEST['password'];

    if($username!=” && $password!=”)
    {
    $user = $collection->find( array(‘username’ => $username,’password’ => $password));
    $num = $user->count();
    if ($num > 0)
    {

    $_SESSION['userid'] = $user['_id'];
    header(‘Location: index.php’);
    }
    else
    {
    echo ‘error’;
    }
    }
    else
    {
    echo “Fields are blank.”;
    }
    }

    • http://rclayton.silvrback.com/ Richard Clayton

      This is clearly irrelevant to the blog post. Please ask the question on StackOverflow.

  • http://sdrawde.net/blog/ cce32

    The other piece to a pagination solution is knowing the total number of pages. I have found even when cursor = find(query).skip(x).limit(y) is fast enough the cursor->count() is the limiting factor. Any suggestions?

  • Roger Waters

    Hi,

    There might be situations where using something as precise as _id may not be feasible. Let’s say for example, I have a collection where documents have the following schema:

    ——————————————————————————–
    _id (document id) | highest_score (0 to 100) | last_active (date time)
    ——————————————————————————–

    where game scores (between 0, 100) are recorded for users. If you’d like to display a list of users that have the highest score (and for presentation friendliness sort it along the last active time). Since the granularity of both score and time are coarse and there can be may documents where score and last active time can be the same, if find and limit and get the first 10, what is the guarantee that the next set you fetch won’t: give you results that doesn’t overlap with what you fetched already or worse possibly skip documents? Note that the correlation between _id and the rest of the fields in the document in this case would be purely coincidental.

    How would you deal with a situations like this?

    • Denise

      My same thoughts. Works great for _id, and other any query which guarantees uniqueness for your sort params. Otherwise you risk missing or duplicating documents, without additional logic to exclude documents already matched.

      One solution would be add an extra column to guarantee record uniqueness amongst the scores.

      { score : [0-100], unique_score : [0-x]}

      Now you always have a sort order for your matching scores. The sort order is irrelevant to the results, whoever’s ingesting the data won’t care about the order of documents by unique_score.

      When paging, if your last document is :
      { score : 90, unique_score : 3 }

      Your query for next docs would be something like :
      db.users.find({ $or : [{ score : 90, unique_score : { $gt : 3 }}, { score : { $gt : 90 }}] })
      .sort({ score : 1, unique_score : 1 })
      .limit(10);

      There’s more work when inserting your documents, but if faster page speed is the goal this is one solution.

      • Denise

        I’ve started to implement the above, and can state that the performance improvement is obvious, (no exact figures yet, but can tell it’s running faster).

        Anwyay, there’s no need to add a second parameter as _id is always unique. My generic implementation always sorts by _id last to ensure uniqueness. Thanks for sharing this way of paging!

      • RU

        I have a very similar requirement. We store changes (more like audit logs) in a mongo collection, and need to query these in batches starting from a given timestamp. There could be multiple changes that occurred at the same time, so i have the same problem of trying to keep track of where we left off and which exact record to start the next batch. Querying only on timestamp won’t help as we duplicate records, and there is a possibility that we go into an infinite loop if the batch size is lesser than the number of documents with the same timestamp.

        Could you elaborate a bit more on your solution, i think i understand the gist but would like to get more clarity

        • Denise

          Sure I can elaborate. I ended up building a generic solution (any query & sort for any collection), and it’s been working well for months. In your case you have specific query, so this implementation can be done pretty simply. There are two options. Option (1), if you have unaltered mongodb _ids and your server will not move any time soon (NY -> LA, etc), you could just use the _id, the server’s timestamp is embedded the mongodb _id, and all _ids are in create date order. For each subsequent result after the first, add the additional filter to your query {_id : { $gt : last_id_from_prior_batch }} to your query. This guarantees the next set up results. Option (2), if _id alone won’t suffice then the filtering is still based on the last record retrieved, but written differently. $and : [{_id : { $gt : last_id_from_prior_batch }}, { timestamp : { $gte : last_timestamp_prior_batch}}]. Notice that the timestamp is using $gte for comparision, the _id part will guarantee uniqueness. To generalize this, the additional filter part added to the query is created from the sort parameters and the last result of the prior batch of documents retrieved.

          • RU

            “all _ids are in create date order.” From the mongoDB docs, this is not a guarantee. The docs have a natural ordering and cannot be guaranteed that they are in insertion order, unless using a capped collection.

            So i’m going to try something using option 2, the main point being that we need to track the last record returned and somehow tell mongo to start from there without making use of skip(). The query would be something like start with record x and give me next batch.

            Thanks for info/help

          • Danielle

            Let us know how option 2 worked out for you?

  • https://twitter.com/CliffordFajard0 Clifford Fajardo

    Hello Dharshan,

    I tried your method of using _id’s of the Mongo documents, however I’m having problems, as it seems Mongo doesn’t reliably store documents in ascending order. I actually posted a stackoverflow question about my problem with examples here:

    http://stackoverflow.com/questions/34373254/mongodb-how-to-get-the-newest-documents-within-a-collection-by-using-id

  • Irakli Jishkariani

    in the second approach: when sorting by other field, you may miss some documents and/or documents may be repeated from the previous page

    • Elad Nava

      Absolutely correct — do not sort the second query by any field other than _id!

  • Bishal Dangal

    What one should do is, first find out the size per page and then divide whole documents into blocks and extract the _ids of first data of each block and then save it somewhere. And when the page changes, map those id’s and send those to the mongodb pagination and get the result. This is just a theory though :)

  • David Alejandro Londoño Mejía

    use this module to paginate with a cursor https://github.com/davidlondono/mongoose-paginate-cursor

  • long night

    Approach 2 is no good when you have a large pagination such as: 1, 2, 3, … …. 498, 499, 500, people will navigate to last page.

  • Jayash Samaiya

    How will I move to previous page if I am using second approach

  • chovysblog

    Found some issues — I can only get it to work if my sort is `.sort({ _id: 1 })` — reverse sorts (-1) do not work because the “after” id is wrong.

    I cannot sort by anything other than id ascending.

6 Shares
+12
Tweet
Share2
Share2
Pin