How to write a MapReduce framework in Python

An easy to follow guide

I tried to explain as much as I could,” Poppet says. “I think I made an analogy about cake.” “Well, that must have worked,” Widget says. “Who doesn’t like a good cake analogy?
— Erin Morgenstern, The Night Circus

I do not know why I decided to name this framework mapcakes, but I love that name and everybody loves cake.. anyway…

MapReduce is an elegant model that simplifies processing data sets with lots of stuff (a.k.a large datasets). As a result of a weekend project here’s an overly simplistic Python MapReduce framework implementation. In this post you will be guided through the steps I followed and an example implementation for counting words applied toUnveiling A Parallel: A Romance” taken from project gutenberg. The finished version of the code is present as mapcakes on github. Here are a couple of choices that were made for the implementation:

  • We are going to use CPython version 2.7.6.
  • The multiprocessing module is used to spawn processes, by calling the start() method on a created Process object.
  • There is an output file corresponding to each reduce thread.
  • The outputs can be merged into one single file in the end.
  • The results of the map step (as well as the output files for each reduce thread) are stored in memory using JavaScript Object Notation(JSON).
  • One may choose to delete or leave these files in the end.

Read on Medium