Q. How do you eat a pie?
A. Pick it up, and eat!
Q. How do you eat a cake?
A. Cut it into pies. And we know how to eat a pie.
Since the time we started learning Computer Science, we were taught that dividing a problem simplifies our job. Similar strategy is applied in Map-Reduce programming. We are given a large data set. To process it, we find ways to divide it, and execute the divisions in parallel. In frameworks like Hadoop, (See : Hadoop – A Gentle Introduction ) we can create clusters/grids of regular server machines and use it to process heavy duty jobs. So each piece is executed separately, and the results are then merged to solve a bigger proble.
What is MapReduce?
It is a programming paradigm for processing large datasets used in distributed computing environments. It was popularized by Google in a famous paper (check links). The terms Map and Reduced are inspired from Functional Programming.
Apache’s Hadoop further popularized MapReduce by providing an open-source framework that can be used by anyone. This has further triggered research and development of newer forms of storing and processing data.
MapReduce – Explained
One of the simplest scenarios that most tutorials on Map-Reduce talk about is WordCount problem. We’ll simplify it further to make it more interesting.
Remember the two functions :
- Map : Divide and Distribute the job
- Reduce : Collect results and combine to form the final output
Assume you’re in a beach with a group of friends. You’ve collected some pretty pebbles of different colors. There are some blue ones, some red, and some white. You want to count how many of which colors you have got. Will you count them one by one? Hectic job, isn’t it?
Now I remind you that you are with a group of friends. Let’s divide the pebbles among 3 guys : Alice, Bob and Charlie. All have got some random pebbles and started counting.
Now Alice, Bob and Charlie have counted the number of pebbles of each color. This was MAP.
In reduce, you aggregate the count of each color. This seems to be much easier job. You don’t have to count the pebbles yourself. You just told your friends to do so and tell you their counts. Now you REDUCE these details together to get the final Pebble-Count.
Just add them. We have 16 Red, 7 Blue and 8 White pebbles. Enjoy!
In this post, I have tried to simplify Map-Reduce to the extent that you don’t need to bug yourself up with big books and confusing algorithms. Now that you are ready, we’ll soon have a session on actual Map-Reduce programming on Hadoop.
Referenes and Further Readings :
- Google paper on MapReduce by Ghemawat and Dean :
- Kids and Candies analogy by Michael Hausenblas