What Hadoop Looks Like as a Business Organization, Part 1

  • Alan Lee

September 18, 2015

Hadoop has existed for roughly a decade since it was first conceived at Yahoo by Doug Cutting and Mike Cafarella in 2005. The reason it has become so ubiquitous is due to the fact that it is an extremely useful technology for processing batch jobs involving massive amounts of data on non-specialized hardware. This can be especially relevant in the P &C insurance industry where data can easily reach Big Data levels and the need arises to scale storage and analytics on cost efficient infrastructure. The other day, I was wondering what the Hadoop paradigm would look like as an office organization and realized that it was a great way to illustrate how the technology fundamentally works. This analogy will be an oversimplification since the actual details of the Hadoop framework are very complicated, but it still lends well to the comparison.

Imagine a company that runs 10,000 stores across the world. This company has an organization called Manual Hadoop (MH), whose sole job is to calculate the total product sales of the previous day for each individual store on a daily basis and store the data for reports. MH has three basic components, which are as follows: hard drives for storing the data, managers for assigning work, and analysts who crunch the numbers. The hard drives store all the data that the company collects on a daily basis. This is their version of the HDFS or Hadoop File System, the native storage format of Hadoop. The managers (Hadoop Mappers) look at the data, format it and give it to the right analyst to do the actual analysis. The analysts (Hadoop Reducers), accept data from all the managers giving work, analyze the data, and then save it back into the HDFS. This is a simple example that gives us an idea of the basic process in a Hadoop job. Depending on the task at hand, MH can scale the number of Mappers and Reducers to the appropriate number to finish large or small tasks in a timely manner. We can actually take this analogy much further to illustrate some details of the process.

Let’s start with the HDFS. The company’s sales across the world for the previous day are delivered at the beginning of each working day and stored across multiple hard drives (e.g. 10). Each hard drive sits in its own room with two desks where its data is easily accessible (more about the desks next time). We call each of these rooms a DataNode and its primary responsibility is to house the data for consumption. But the data doesn’t just exist in one place. Each piece of data in one DataNode resides in another DataNode somewhere else. The data is always copied across multiple rooms, in case the hard drive in one room becomes corrupted. Therefore, if a single room gets flooded from the bathroom on the floor above or catches fire, no data is lost and daily operations can continue. In order to facilitate this type of failsafe redundancy, each room has a connection to the other rooms where the hard drives can ship each other data in case it is necessary. However, if enough rooms catch fire, then that becomes a problem since there may be data that is lost without a backup. This is the heart of the HDFS, a redundant and distributed data source for the Hadoop engine. In my next post, I will discuss the process of gathering and analyzing the data.