Hadoop in my IT department: benchmark your cluster

le 07/01/2013 par Rémy Saissy

The stress test is a very important step when you go live.

Good stress tests help us to:

ensure that the software meets its performances requirements
ensure that the service will deliver a fast response time even under a heavy load
get to now the scalability limits which in turn is useful to plan the next steps of the development

Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need to becnhmark the cluster which means assessing its performances by running a variety of jobs each focused on a specific field (indexing, querying, predictive statistics, machine learning, ...).

Intel has released HiBench, a tool dedicated to run such benchmarks. In this article, we will talk about this tool.

What is HiBench?

HiBench is a est un collection of shell scripts published under the Apache Licence 2 on GitHub : https://github.com/intel-hadoop/HiBench

It allows to stress test a Hadoop cluster according to several usage profile.

Micro Benchmarks

WordCount

This test dispatches the counting of the number of words from a data source.

The data source is generated by a preparation script of HiBench which relies on the randomtextwriter of Hadoop.

This test belongs to a class of jobs which extracts a small amount of information from a large data source.

It is a CPU bound test.

Sort

This test dispatches the sort of a data source.

The data source is generated by a preparation script which relies on the randomtextwriter d'Hadoop.

This test is the simplest one you can imagine. Indeed, both Map and Reduce stages are identity functions. The sorting is done automatically during the Shuffle & Merge stage of MapReduce.

It is I/O bound.

TeraSort

This test too dispatches the sort of a data source.

The data source is generated by the Teragen jobs which creates by default 1 billion of 100 bytes lines.

These lines are then sorted by the Terasort. Unlike Sort, Terasort provides its own input and output format and also its own Partitioner which ensures that the keys are equally distributed among all nodes.

Therefore, it is an improved Sort which aims at providing an equal load between all nodes during the test.

With this specificity, this test is:

CPU bound for the Map stage
I/O bound for the Reduce stage

Enhanced DFSIO

This test is dedicated to HDFS. It aims at measuring the agregated I/O rate and throughput of HDFS during reads and writes.

During its preparation stage, a data source is generated and put on HDFS.

Then, two tests are run:

A read of the generated data source
A write of a large amount of data

The write test is basically the same thing as the preparation stage.

This test is I/O bound.

Web Search

Nutch indexing

This test focuses on the performances of the cluster when it comes to indexing data.

In order to do it, the preparation stage generates the data to be indexed.

Then, indexing is performed with Apache Nutch.

This test is I/O bound with a high CPU utilization during the Map stage_._

Page Rank

This test measures the performances of the cluster for PageRanking jobs.

The preparation phase generates the data source in the form of a graph which can be processed using the PageRank algorithm.

Then, the actual indexing is performed by a chain of 6 MapReduce jobs.

This test is CPU bound.

Machine Learning

Naive Bayes Classifier

This test performs a probabilistic classification on a data source.

It is explained in depth on Wikipedia.

The preparation stage generates the data source.

Then, the test chains two MapReduce jobs with Mahout:

seq2sparse transforms a text data source into vectors
trainnb computes the final model using vectors

This test is I/O bound with a high CPU utilization during the Map stage of the seq2sparse.

When using this test, we didn't observe a real load on the cluster. It looks like it is necessary to either provide its own data source or to greatly increase the size of the generated data during the preparation stage.

K-Means clustering

This test partitions a data source into several clusters where each element belongs to the cluster with the nearest mean.

It is explained in depth on Wikipedia.

The preparation stage generates the data source.

Then, the algorithm runs on this data source through Mahout.

The K-Means clustering algorithm is composed of two stages:

iterations
clustering

Each of these stages runs MapReduce jobs and has a specific usage profile.

CPU bound for iterations
I/O bound for clustering

Analytical Query

This class of tests performs queries that correspond to the usage profile of business analysts and other database users.

The data source is generated during the preparation stage.

Two tables are created:

A rankings table
A uservisits table

This is a common schema that we can meet in many web applications.

Once that the data source has been generated, two Hive requests are performed:

A joint
An agregation

These tests are I/O bound.

Using HiBench

Run a stress test

Run HiBench is not very hard:

retrieve the sources on GitHub
Ensure that nobody is using the cluster
Ensure that you have correctly configured your environment variables

Then, the file bin/hibench-config.sh contains all the options to tune before startting the stress test. It includes the HDFS directory where you want to write both source and result data, the path of the final report file on the local filesystem, ...

Once configured, ensures that the HDFS directory where you want to write your data source and your results exists on the cluster and run the command bin/run-all.sh. Now you can take a coffee... or two.

Interpretation of the results

Results are written into the hibench.report file with the following CSV format:

_test_name end_date <jobstart_timestamp,jobend__timestamp> size_in_bytes duration_ms throughput

Beware that the actual result file does not contain the column header above.

The DFSIOE test also writes a CSV and an interpretation of its results in its subdirectory dfsioe.

Limitations

Latest Hadoop support

Presently, HiBench runs on Hadoop 1.0. This means that the latest versions of Cloudera or Hortons Works distributions for example won't be able to run all tests since they rely on Hadoop 2.

However, the effort necessary to support Hadoop 2 is not that big for the majority of the tests since it is mainly a matter of updating configuration parameter names.

Also, HiBench alone is not enough for a good report of a stress test. It is necessary to also retrieve the informations provided by the JobTracker/ResourceManager like the mean execution time of Maps, Reduces, Shuffle and Merges of every job in order to build an accurate final report.

A public benchmark repository, a big lack

This is a lack that HiBench tried to address through its wiki page which invites you to post your results but with no success until now.

Building a public benchmark repository in order to provide a set of meaningful metrics to compare a cluster is still an uncover issue but would be interesting and quite useful.

What are the alternatives?

An alternative exists to HiBench, but it is more focused on a specific usage profile.

GridMix

GridMix is included in Hadoop besides the example jobs like TeraSort, Sort, ...

However, it generates MapReduce jobs which are focused on sorting large amount of data and does not cover other profiles like Machine Learning.

Conclusion

In spite of these drawbacks, HiBench greatly simplifies the benchmarking of a Hadoop cluster.

In the future, this domain will certainly see new tools with more functionalities and a better coverage or different usage profiles. It is only the beginning.