Here’s a list of possible MapReduce related patterns I’ve been thinking about:
- map-update: update each mapped document and emit its updated or original version;
- map-delete: delete each mapped document;
- map-reduce-map: map results of a map-reduce;
- map-map: map results of a map;
- map-recurse: apply recursion to the mapping function until a stop condition occurs;
- any combination of patterns, e.g., map-reduce-map-update.
- get the e-mail address of every customer with a negative balance: map-reduce-map. First, map-reduce to get the aggregate balance for each customer, then map again to get only customers with negative value and emit their e-mail address;
- delete all documents older than one month: map-delete. First, map to get all documents older than one month and then delete each one;
- get a list of documents and mark them as read: map-update. First, map to get the list of documents according to a given criteria, then update each document marking it as read;
- and so on…
Just a quick note to point to two valuable MapReduce related resources:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains.
Do you suggest any other interesting publications?
Some characteristics of this new database system:
- every document must contain only JSON data;
- every document can be manipulated on the filesystem: the idea is to allow someone to edit a document using vim, for instance, without breaking anything. It also lets you easily backup, move or replicate your data without creating a heavy load on the database;
- the underlying filesystem can be changed: changing the underlying filesystem offers numerous possibilities, like easy distribution (using OpenAFS, XTREEMFS or others), replication (using finefs or others), and even attaching different backends (using FUSE, for example);
- the system must be available through an HTTP REST interface: this makes it very easy to immediately integrate the system into any existing applications. Bonus points if the API is compatible with CouchDB;
- queries must be performed using a MapReduce approach: this should make it very easy to perform queries on a large dataset.
So, right now I’m half way through it. I managed to write the backend and I’m on my way to the HTTP REST interface. I’ll create a github repo after I have something functional that can be easily built and tested.
What do you think of a system like this? Any features you’d like to add or change?