An easy to follow guide
I do not know why I decided to name this framework mapcakes, but I love that name and everybody loves cake.. anyway…
MapReduce is an elegant model that simplifies processing data sets with lots of stuff (a.k.a large datasets). As a result of a weekend project here’s an overly simplistic Python MapReduce framework implementation. In this post you will be guided through the steps I followed and an example implementation for counting words applied to “Unveiling A Parallel: A Romance” taken from project gutenberg. The finished version of the code is present as mapcakes on github. Here are a couple of choices that were made for the implementation:
- We are going to use CPython version 2.7.6.
- The multiprocessing module is used to spawn processes, by calling the start() method on a created Process object.
- There is an output file corresponding to each reduce thread.
- The outputs can be merged into one single file in the end.
- The results of the map step (as well as the output files for each reduce thread) are stored in memory using JavaScript Object Notation(JSON).
- One may choose to delete or leave these files in the end.