The first beta version of Hadoop 2 has just been released. It is the 2.1.0.
More interesting, the stable version is expected to follow by mid-September.
Only a few bugs away from Hadoop 2!
That’s a good news but you might be wondering why you should care about it? After all, the most important is what this new major version can bring to your next datalab or to your production cluster, isn’t it?
In this article, we will cover the differences between Hadoop 1 and Hadoop 2 that you should care about.
Probably the first thing that comes to mind when talking about Hadoop 2, YARN is also the most important change.
In YARN, both the JobTracker and the TaskTracker have been replaced by the ResourceManager and the NodeManager, two daemons dedicated to cluster resources scheduling.
Below some facts implied by YARN:
A better scalability
The first reason for this architectural evolution is a more efficient resources management which fosters scalability of clusters beyond several thousand nodes.
Good to know but these are extreme cases and it is most likely that you won’t have to cope with such giant clusters.
New distributed computing frameworks
Much more interesting is the side effect of this architectural evolution. Indeed, MapReduce is now a framework like any other.
Well, it is similar to migrate from a mass production to tailor made. Now you have the opportunity to use the distributed computing paradigm that fit the best to your needs given your data and the algorithms you want to apply on it.
And it has already begun. Let’s looks at these few examples:
Tez, a tailor made MapReduce for Hive and Pig
Yes, Tez is a reimplementation of the MapReduce paradigm. However, it integrates some specificities that makes it fit much better to the specific needs of SQL like analytics tools.
Therefore, a request performed through Tez generates less jobs to run on a cluster. Who says less jobs means less resources needed to perform the actual work, thus leading to faster execution time.
Storm-Yarn and Spark Streaming, for real time distributing computing and data flow processing
Twitter Storm and Apache Spark are Open Source solutions for real time distributed computing and data flow processing. Yahoo! has recently released a port of Storm for YARN: Storm-Yarn.
Therefore, if your problem is to build computing or dataflow processing topologies (to compute agregates, predictions using R ou something else, …) on large data flow volumes that feed the cluster in real time, it is now possible to do it without needing to build a separate and Storm dedicated cluster and without generating MapReduce jobs.
Giraph, for distributed graph analysis
In a previous article, we discussed about graph databases and how to process it. Apache Giraph is an Open Source solution for distributed graph processing. Facebook uses it to analyse its social graph.
Therefore, now you can rely on the resources of your Hadoop cluster to perform this kind of analysis.
A finer grained control over resources allocation
In YARN, the ResourceManager and the NodeManager are dedicated to the cluster resource management and their scheduling. Since MapReduce is no longer tightly coupled to resource manager, a Hadoop 2 cluster does not manage mappers and reducers slots. Instead, we talk about containers which is a more generic abstraction. A container can execute any kind of processing depending on the framework you chose to use for that specific job.
Concretely, what is the difference?
Unlike Hadoop 1, a container is sized by two parameters:
- the minimum and maximum amount of memory it can use
- the minimum and maximum number of vcore it can use
Simpler libraries to write your own YARN applications
The API used to develop a distributed computing framework on YARN (called an application), is a bit complex. New libraries have been written in order to simplify such development thus fostering the integration of new frameworks on top of YARN.
Therefore, if you have a computing need that is currently not properly fulfilled by the existing frameworks or if the cost to port your existing distributed codes is to high, you can write your own YARN application and thus take advantage of Hadoop distributed filesystem and resource allocation in a better way.
Hadoop on Windows
Hadoop 2 now officially supports Windows Server and Windows Azure. Moreover, Microsoft is increasingly involved in Hadoop development.
Therefore, if your IT and your internal skills are mainly on Windows platforms, moving to Hadoop is no longer a synonym of moving to GNU/Linux. And that can cut some superfluous training costs.
HDFS also benefits from some major improvements even thought some of them are already widely spread in the most common Hadoop distributions.
One of the first major improvement of HDFS 2 is the High Availability of the NameNode. You already know it since it has already been released in most if not all Hadoop distributions.
It is now possible to create readonly or readwrite snapshots of HDFS directories.
These snapshots have the following important features:
- the cost of creating a snapshot and the memory space it uses is claimed to be constant (O(1) complexity)
- quotas on the maximum number of snapshots
- snapshots metrics are available through the NameNode WebUI
The HDFS command line and web client were not very user friendly and convenient to use, especially for non technical people.
Now, it is possible to access HDFS though an NFS share. Therefore, HDFS is viewed and manipulated like any other network share on a network.
Used in conjonction with Kerberos which is already supported in Hadoop, it is now possible to provide HDFS shares to non tehcnical people in your company, thus facilitating its adoption.
However, it is a new feature and like any other new feature, it will be necessary to assess it first in order to understand how HDFS specific features (block sizes, write once files, append only, …) impact the use of an NFS share.
With this new major release, Hadoop has reached a new maturity level and is now ready for our IT departments and corporations.
Moreover, the integration of NFS, snapshots and the possibility to use new distributed computing frameworks clearly make Hadoop the glue, that is facilitating the development of large scale processing on large volumes of data.