Tuesday, April 27, 2010

PagedQuery - Easy paging using cursors on AppEngine

In the 1.3.1 release of Google App Engine, the App Engine team added support for Query cursors. I am happy to release a small and simple utility, PagedQuery, for providing a paging abstraction that uses the cursors and caches them to the memcache.

A little background if you are not up to speed on the history of data paging on App Engine: In the beginning, the App Engine datastore did not provide much support for paging operations on the data it returned. The fetch() method of a google.appengine.ext.db Query or GqlQuery object did provide an offset parameter that could be used to bypass previous results. As the App Engine team were quick to point out, however, using the offset parameter still retrieved the offset results from the datastore, they just were not returned. So my_query.fetch(limit=10, offset=10) actually retrieved 20 results, even if only 10 were returned.

This was perhaps acceptable for small datasets, but it was not an efficient use of App Engine quota, performed sub-optimally and truly did not scale against the other limitations the platform imposed.

There existed a way to fashion a query using a known sort order and a filter based on unique values to create a bookmark. The App Engine team originally wrote an article about this and left it to the community to provide a solution. Thankfully Rodrigo Moraes stepped up to the plate and wrote the easy to use Pager.py, which I have been using successfully for a while now in My Web Brain. The helper object was not without its limitations, but given the complexity of the algorithms involved I was not able to contribute much to their resolution.

Finally, the App Engine team rode to the rescue with the implementation of several datastore enhancements in App Engine SDK release 1.3.1 (also see the release notes). In this release they implemented query cursors, an official version of the bookmarks used by the community. Apart from the usual problematic property types, cursors enjoyed support for all of the remaining property types (which was an issue for Pager) and for both Query and GqlQuery. Query cursors work in a straightforward way: perform a fetch on a query, and then call the query.cursor() method to retrieve a base 64 encoded cursor string, which could be passed back to a query with the with_cursor() method to start off where the last query ended.

Despite the advantage of being officially supported, Cursors still have some limitations when compared with the venerable Pager.py. Pager.py provided a previous cursor, for example, to skip to the previous page's set of results, and the overall Pager.py interface presented the paging metaphor, which is a lot more natural then managing cursors.

Of course, the App Engine Team have done the heavy lifting with providing a supported cursor in the first place, which makes it easy for someone like me to create a simple Query wrapper which exposes page controls. Which is exactly what I have done in the form of PagedQuery.

PagedQuery exposes a simple paging interface with methods decorating the traditional query object such as get_page(), has_page() and page_count(). The cursors themselves are hidden behind the scenes using memcache for persistence between requests - this means as a PagedQuery user you only need to track the current page number.

The PagedQuery uses a best-effort approach to using cursors for efficient data retrieval, while still allowing a full range of paging functionality. For example:
  • PagedQuery will use the Query method count() to understand how many results are present, but then cache the result so it does not need to be called again. 
  • If you start with the first page of results and proceed sequentially forward (maybe with some detours backward) across multiple requests, each data retrieval will use a previous cached cursor. However, if you skip forward the PagedQuery will use offset in order to find the result.
The intent is to provide an easy paging interface which may not be completely free of costly datastore calls, but that is appropriate for many situations and which allows a full set of paging functionality.

If you think this might interest you, check out the source - I tried to keep it well commented. If you have any questions or would like to log an issue, go ahead and visit the he3-appengine-lib project page on Google Code.

I will be adding my unit tests shortly and I have a minor change to make already. Some refactoring, and possibly some more Query wrapper classes, are also on the way. 

12 comments:

  1. Hi,

    Would you like to add support for paging, for SearchableModel? Its very convient to use what you wrote, and I really appreciate it. Adding support for SearchableModel would make it complete.

    ReplyDelete
  2. Thanks for the suggestion, Harshal. I'll add it to the issues list. I haven't used Searchable model so I do not know much about it, but that hasn't stopped me doing anything before :)

    ReplyDelete
  3. There is a small bug when using has_page function
    has_page(0) should return false but it returns true

    ReplyDelete
  4. Thanks for the bug report Nick. I've created an issue on the Google Code hosting site and I'll fix the issue soon (as well as extend my unit testing to catch this in the future). Thanks again for the report!

    ReplyDelete
  5. the has_page(0) == true bug is now fixed with r26.

    ReplyDelete
  6. Cool solution Ben. Congratulations! Just one question, how do you mantain the PagedQuery object between the different requests? Saving it in session? Saving it in memcache? I have implemented the second approach (I can share the code if someone needs it) but I wanted to ask ho do you use it.

    ReplyDelete
  7. Hi Agualis, thanks for your comments. In a nutshell I store a list of query cursors in memcache, using a key I derive from pickling and then hashing the query. Not rocket science. I think a lot of people were doing something similar. I just packaged it up :)

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Hi Ben, Thanks so much for this. I have a few questions. Alfred Fuller at the Google IO Next Gen Queries talk spoke about enhancing cursors to support forward and backward paging. Do you know what's going on with that. And how would that affect the paging support that you created.

    The reason I ask is I'm adding paging support to some of my app engine applications and I'd like to do it in the best way.

    Isn't it true that page back may not work using your approach, depending on the memcache and the whether the user paged through the result before they page back.

    ReplyDelete
  10. Hi fedex1 - I haven't had a chance to watch the Google IO talk or any other of the Google IO talks from this year yet. I will need to get on with it before next year's are posted (or I hopefully I see them in person :))

    You are correct in that once the memcache copy of the page cursors expires the page query will need to cheat by using an offset. In the current implementation all the page cursors are saved as a single memcache entity for that query, so they will all drop out at once if that happens.

    If it is easily possible to generate a backward cursor this would be ideal and would allow efficient backward paging regardless of memcache state.

    Thanks for your comment and the tip about the Google IO session. If anyone is interested they can find the video at http://www.youtube.com/watch?v=ofhEyDBpngM

    ReplyDelete
  11. How does this apply to the ndb API, could it be easily adapted to ndb?

    ReplyDelete
  12. Hi Ben, Thanks a million for this solution. I have found PagedQuery immensely helpful and it certainly made life much easier when i came to adding paging support to a small appengine app i am currently working on.
    Thanks u once again and congrats.

    ReplyDelete