In my previous post, I began illustrating what Hadoop would look like as a business organization, focusing on the DataNodes in a Hadoop File System. Assuming all the rooms or DataNodes are in good condition, how do we start the process of gathering and analyzing data? The process begins with the managers (Mappers). Each Mapper (manager) sits at one of the two desks in each room. It is their job to make sure that all the data that needs to be analyzed together ends up in the same room. This is because the Reducers (analysts) are only going to sum up individual store sales ONCE before they report results. If they don’t have the complete information, they will come up short in their sums. For example, since they want to calculate the total sales for each of the 10,000 stores, they need to make sure that all sales for Store 1, Store 2, etc. are the same room. By necessity, each room houses the data for many stores, but by the time Mappers are done, the Reducers know that if they have the data for Store 1, then they have ALL the data for Store 1 so that they can reliably tabulate the sales for that Store. The managers obviously know what data they are looking at is redundant for fault tolerance so that duplicate data is not analyzed. Mappers also mark each bit of data with a key that identifies what Store the data belongs to. This facilitates the shuffle of data between the DataNode rooms so that it can be easily identified where the data belongs.
Another thing to note is that the Reducers can‘t start their analysis until all of the data has been sorted. What happens if one of the Mappers is not feeling well and is working at a fraction of the pace? None of the Reducers can begin their analysis, in case the Mapper has some of the data they need. In this case, MH or Hadoop will send the lagging Mapper’s data to the faster Mappers that have completed all or most of their tasks so that a single slow Mapper will not hold up the entire process.
So now that all the data has been successfully shipped, the Reducers can begin their job in earnest. They sit in the second desk in each room and first sort all the data given to them by the Mappers by key (e.g. store number). Then for each store number they perform their analysis. For example, if a Reducer was assigned Stores 1, 5 and 10, they would grab all data with keys =1 and calculate the total sales and repeat the same process for Stores 5 and 10. After each store’s calculation is complete, the result would be stored as something like Store=1, total volume=10000 back into the hard drive in the room (but a different directory). What is left at the end of each process is a dataset that lists the Store and the total daily sales at that Store distributed across the ten DataNode rooms. The data can either be collected for a final report or the entire map reduce process can begin again. What if MH wanted the total number of sales for each geographical region? They would take the output from the previous process and now have the Mappers map the store number to the region number and send it off to the Reducers to repeat the cycle.
This example has been a very simple analogy. There are even more complexities to Hadoop. Mappers can have assistant analysts (Combiners) that do light processing before sending the data to the actual Reducers. They can also have assistant organizers (Partitioners) who actually decide which Reducers get which stores instead of relying on Hadoop’s framework. Analysts can also sort data any way they would like (custom comparator), etc. In addition, not every room needs to have a Mapper or Reducer present. Some will have one, both or even none based on the details of the analysis and what is most optimal. There are numerous knobs to turn, which is why the Hadoop framework is so powerful, flexible and complicated. In the end, this framework is extremely valuable because it allows the parallelization of massive amounts of data with variable hardware resources. Got 1 billion stores? Hire 1,000 Mappers and 10 Reducers. Got one store? Fire everyone except for one Mapper and Reducer. It is this level of flexibility and scalability that has propelled Hadoop to the forefront of Big Data analytics and why we leverage this type of infrastructure to churn through massive amounts of insurance claims data at Guidewire with minimal time and commodity hardware.