Hadoop in my IT department: benchmark your cluster
The stress test is a very important step when you go live.
Good stress tests help us to:
- ensure that the software meets its performances requirements
- ensure that the service will deliver a fast response time even under a heavy load
- get to now the scalability limits which in turn is useful to plan the next steps of the development
Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need to becnhmark the cluster which means assessing its performances by running a variety of jobs each focused on a specific field (indexing, querying, predictive statistics, machine learning, ...).
Intel has released HiBench, a tool dedicated to run such benchmarks. In this article, we will talk about this tool.
What is HiBench?
HiBench is a est un collection of shell scripts published under the Apache Licence 2 on GitHub : https://github.com/intel-hadoop/HiBench
It allows to stress test a Hadoop cluster according to several usage profile.
Micro Benchmarks
WordCount
This test dispatches the counting of the number of words from a data source.
The data source is generated by a preparation script of HiBench which relies on the randomtextwriter of Hadoop.
This test belongs to a class of jobs which extracts a small amount of information from a large data source.
It is a CPU bound test.
Sort
This test dispatches the sort of a data source.
The data source is generated by a preparation script which relies on the randomtextwriter d'Hadoop.
This test is the simplest one you can imagine. Indeed, both Map and Reduce stages are identity functions. The sorting is done automatically during the Shuffle & Merge stage of MapReduce.
It is I/O bound.
TeraSort
This test too dispatches the sort of a data source.
The data source is generated by the Teragen jobs which creates by default 1 billion of 100 bytes lines.
These lines are then sorted by the Terasort. Unlike Sort, Terasort provides its own input and output format and also its own Partitioner which ensures that the keys are equally distributed among all nodes.
Therefore, it is an improved Sort which aims at providing an equal load between all nodes during the test.
With this specificity, this test is:
- CPU bound for the Map stage
- I/O bound for the Reduce stage
Enhanced DFSIO
This test is dedicated to HDFS. It aims at measuring the agregated I/O rate and throughput of HDFS during reads and writes.
During its preparation stage, a data source is generated and put on HDFS.
Then, two tests are run:
- A read of the generated data source
- A write of a large amount of data
The write test is basically the same thing as the preparation stage.
This test is I/O bound.
Web Search
Nutch indexing
This test focuses on the performances of the cluster when it comes to indexing data.
In order to do it, the preparation stage generates the data to be indexed.
Then, indexing is performed with Apache Nutch.
This test is I/O bound with a high CPU utilization during the Map stage_._
Page Rank
This test measures the performances of the cluster for PageRanking jobs.
The preparation phase generates the data source in the form of a graph which can be processed using the PageRank algorithm.
Then, the actual indexing is performed by a chain of 6 MapReduce jobs.
This test is CPU bound.
Machine Learning
Naive Bayes Classifier
This test performs a probabilistic classification on a data source.
It is explained in depth on Wikipedia.
The preparation stage generates the data source.
Then, the test chains two MapReduce jobs with Mahout:
- seq2sparse transforms a text data source into vectors
- trainnb computes the final model using vectors
This test is I/O bound with a high CPU utilization during the Map stage of the seq2sparse.
When using this test, we didn't observe a real load on the cluster. It looks like it is necessary to either provide its own data source or to greatly increase the size of the generated data during the preparation stage.
K-Means clustering
This test partitions a data source into several clusters where each element belongs to the cluster with the nearest mean.
It is explained in depth on Wikipedia.
The preparation stage generates the data source.
Then, the algorithm runs on this data source through Mahout.
The K-Means clustering algorithm is composed of two stages:
- iterations
- clustering
Each of these stages runs MapReduce jobs and has a specific usage profile.
- CPU bound for iterations
- I/O bound for clustering
Analytical Query
This class of tests performs queries that correspond to the usage profile of business analysts and other database users.
The data source is generated during the preparation stage.
Two tables are created:
- A rankings table
- A uservisits table
This is a common schema that we can meet in many web applications.
Once that the data source has been generated, two Hive requests are performed:
- A joint
- An agregation
These tests are I/O bound.
Using HiBench
Run a stress test
Run HiBench is not very hard:
- retrieve the sources on GitHub
- Ensure that nobody is using the cluster
- Ensure that you have correctly configured your environment variables
Then, the file bin/hibench-config.sh contains all the options to tune before startting the stress test. It includes the HDFS directory where you want to write both source and result data, the path of the final report file on the local filesystem, ...
Once configured, ensures that the HDFS directory where you want to write your data source and your results exists on the cluster and run the command bin/run-all.sh. Now you can take a coffee... or two.
Interpretation of the results
Results are written into the hibench.report file with the following CSV format:
_test_name end_date <jobstart_timestamp,jobend__timestamp> size_in_bytes duration_ms throughput
Beware that the actual result file does not contain the column header above.
The DFSIOE test also writes a CSV and an interpretation of its results in its subdirectory dfsioe.
Limitations
Latest Hadoop support
Presently, HiBench runs on Hadoop 1.0. This means that the latest versions of Cloudera or Hortons Works distributions for example won't be able to run all tests since they rely on Hadoop 2.
However, the effort necessary to support Hadoop 2 is not that big for the majority of the tests since it is mainly a matter of updating configuration parameter names.
Also, HiBench alone is not enough for a good report of a stress test. It is necessary to also retrieve the informations provided by the JobTracker/ResourceManager like the mean execution time of Maps, Reduces, Shuffle and Merges of every job in order to build an accurate final report.
A public benchmark repository, a big lack
This is a lack that HiBench tried to address through its wiki page which invites you to post your results but with no success until now.
Building a public benchmark repository in order to provide a set of meaningful metrics to compare a cluster is still an uncover issue but would be interesting and quite useful.
What are the alternatives?
An alternative exists to HiBench, but it is more focused on a specific usage profile.
GridMix
GridMix is included in Hadoop besides the example jobs like TeraSort, Sort, ...
However, it generates MapReduce jobs which are focused on sorting large amount of data and does not cover other profiles like Machine Learning.
Conclusion
In spite of these drawbacks, HiBench greatly simplifies the benchmarking of a Hadoop cluster.
In the future, this domain will certainly see new tools with more functionalities and a better coverage or different usage profiles. It is only the beginning.