Monday, June 28, 2010

Maintaining the App Engine Datastore with MapReduce

One of the items on the todo list for my application My Web Brain is to add features which will require changes to the data model used by the application. Making changes to an application's data model in the App Engine datastore is sometimes more painful than it should be, since the developer remains responsible for ensuring that existing entities are made consistent with the new model.

Having recently read about appengine-mapreduce I decided I would write a simple Mapper that looks for and fixes missing property values and invalid ReferenceProperty keys. You can find the ModelHygienceMapper (as I call it) at he3-appengine-lib in the new file (some simple documentation coming soon).

The code for the ModelHygieneMapper that does the work very simple, and mostly worth listing in the event that someone can suggest an improved approach.

def process(entity):
  '''Checks and repairs model integrity of the passed entity
  1. Removes dangling references
  2. Sets undefined datastore values to the default
props = [x for x in entity.__class__.__dict__.values()\
           if isinstance(x, db.Property)]
  changed = False
  for prop in props:
  if not prop.get_value_for_datastore(entity):
    prop.__set__(entity, prop.default_value())
      changed = True          
    elif isinstance(prop, db.ReferenceProperty):
    if not db.get(prop.get_value_for_datastore(entity)):
      prop.__set__(entity, None)
        changed = True
  if changed: yield op.db.Put(entity)

The concept of Map Reduce and the App engine project is more interesting than my simple mapper. The Map Reduce is a framework for breaking up large problems into small parts that can be run in parallel (as I read on Wikipedia). Appengine-mapreduce is a project that has begun the task of implementing this concept on Google App Engine for Python and Java.

The work of using Map Reduce is in defining the mappers and reducers. Mappers perform work on the problem data in parallel, while Reducers (from what I understand), recombinate the outputs based on information in the original dataset. If you can structure a solution to your problem using this paradigm it becomes easier to distribute the work between computers and datacenters. For App Engine developers, Map Reduce is a way to traverse the datastore or other collection of data (for example, lines in a file) effectively without re-engineering the plumbing that allows you to do this.

Using appengine-mapreduce in your application, like I did with ModelHygieneMapper,  is easy. Include the source code for the library in your application and add a URL mapping to your app.yaml file. Create a Mapper.yaml file which contains the configuration for the mapper or mappers you wish to use. If you haven't already, write or include the mapper. The getting started guide for python explains this pretty well.

I will be keeping an eye on appengine-mapreduce in the future for other tasks it will be appropriate for. In the meantime, I am happy that I had a way to implement ModelHygienceMapper without worrying about the plumbing. If you have any feedback about my mapper, please let me know.


  1. Nice! The list comprehension at the start isn't necessary, though: The class method .properties() returns a dict mapping property names to property classes.

  2. Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
    Big data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai