This Christmas I decided to give myself a gift ‘Hadoop: the definitive guide’ by Tom White. As most people who will read this post already know, Apache Hadoop was inspired by Google’s MapReduce to supports data-intensive applications to work with petabytes (1 million gigabytes) of data1. Given the nature of this book I did not put it under the Christmas tree. My wife even questioned why I would give myself such a ‘gift’ instead of something that was more ‘fun’. For me getting to know a new paradigm in computer technology was actually lot of fun. It was something I have been wanting to do for quite a while, but never got around to actually doing it.
I decided to dive directly into Hadoop itself instead of opting for a more top-level approach from Hive. Hive was developed by Facebook to allow a more SQL-like interface to Hadoop2. And SQL is of course well known within the Analytics or Business Intelligence (BI) community. Trying to understand a new paradigm I wanted to get to understand the fundamentals of the Hadoop. In my opinion following something that was already familiar to me and was designed to work with the current BI world would not allow me to grasp the full potential of the new world of Hadoop as many of the new possibilities would be lost in translation. Unfortunately this required me to dive deep into the world of Java programming, which is the most common interface to use Hadoop. Something that I briefly got into in 1996 to develop interactive websites but eventually dropped, because at the time none of the commonly accepted web-browsers (for example Netscape 2) would be able to run the java applets. Talking to Java developers on a regular basis I had already understood that Java had come a long way since 1996, but this was the first time that I had to understand Java code myself.
Going through the examples given in the book and other material from the internet, including the apache project website itself, I realised that Hadoop followed the classic pattern of a paradigm shift as explained by the modern philosopher Thomas Kuhn. As Kuhn explains in his famous work ‘The Structure of Scientific Revolutions’ a new paradigm arises when the current paradigm no longer provides a suitable explanation to all the issues presented to a (scientific) community. Contrary to popular belief, Kuhn describes that a new paradigm does not come from people totally unaware of the current paradigm, or from those who deliberately try to avoid it. Instead it arises when people well aware of the current theories and concepts have to find solutions for problems they can no longer solve using these theories and concepts.
Likewise Hadoop follows the same pattern. Many of the Hadoop examples will give a sense of recognition for many BI proffessionals, especially those who are involved in the Extract Transform and Load or ETL process, which involves getting all the required source data and loading it into a more convenient format. Especially the Mapper function (The Google MapReduce framework is based on two steps Map and Reduce) deals with many issues the ETL developers have to deal with as well. Does dealing with empty values and data quality issues ring a bell? Putting several map and reduce steps in flow gives even more the resemblance of an ETL flow that can be implemented by currently available commercial and open source BI tooling. In this regard Hadoop is well grounded in the existing paradigm.
However Hadoop was clearly designed to solve a different set of needs than are currently served by existing BI applications. Or transaction oriented database technology for that matter. It is in these needs where the Hadoop framework shines. The argument that most people use is that Hadoop can handle more data. As is reflected by the slogan that is printed on the book I bought for Christmas, “Storage and Analysis at Internet Scale”. But for me, this does little justice to Hadoop’s capabilities. Simply because the existing technology can be stretched further to handle more data. An argument many of the existing BI and database vendors and professionals like to give. Which is in line with another concept from Thomas Kuhn, that the new paradigm can never be understood from the old paradigm.
For me the main areas in which the Hadoop shines: to run continuously, to always be on, to serve many users and to quickly do real-time analysis.
- Run continuously
Hadoop was clearly designed to run continuously. Of course it had to be. Google’s users live in every time zone and do not want to wait for a batch window to complete before they can use the service again. Also the internet changes every minute, and working on a week old information or even a day old just would not cut it.
- To always be on
It is also amazing how much effort was put in to making it fault tolerant so it could always be on. I guess the consumers using google were much more demanding than the average business BI users who have to wait because a nightly load was unsuccessful and the process has to start again.
- Serve many users
Hadoop is also designed to server many many users at the same time. Wheras the existing BI paradigm mostly sees a lot of users providing data (through operational systems) and only a few analyzing this data or using the information that is obtained from this data. Often providing information to the select few at the top of the organization is seen as the holy grail in the current BI paradigm. Hadoop almost has the opposite approach, which obviously can analyze a lot of data from various sources, but can also serve millions of users at the same time.
- Quickly do real-time analysis
Instead of calling it ‘big-data’ we might should call it ‘fast-data’. Hadoop’s is not only able to handle a lot of data, but also to quickly analyze it. This does not only give users real-time information, but it also eliminates the need to use pre-aggregated data to give the information fast enough. As a result Hadoop can be used in situations where a lot of newly created or ‘fresh’ data had to be analyzed in a very short timeframe. Which is uncommon for many other frameworks thar rely on aggregation and/or indexing of historic data to make information to be delivered faster.
Following the theory presented, the new paradigm will eventually replace the old one. The benefits of Hadoop described above may seem unnecessary for many organizations today. They could be absolutely vital tomorrow. Increasingly organizations operate 24/7 worldwide, requiring their information to flow continuously. Business users have become more demanding, requiring that their business app’s are always on, as they are accustomed to as a consumer. More and more employees, at all levels of the organization, are demanding that they have access to the information they need for their jobs. And the use of information is increasingly embedded in the daily routines. Strangely enough though existing BI applications struggle to increase their user base. Perhaps a new kind of technology is what is required. Finally an increase of usage of data that comes from outside of the organization, for example social media, or sensor information needs a framework that can analyze vast amounts of new data quickly to provide information quickly enough for decisions to be made in a short enough timeframe.
A couple of years from now we might conclude that new paradigm that Hadoop brings us now will have provided us with the answers for all our issues. Until of course new problems arise that Hadoop can no longer solve. At that time it will be ready to move on the next paradigm, which of course ‘The Structure of Scientific Revolutions’ tells us will continue to happen.
2) Personal interpretation