Scribe : a way to aggregate data and why not, to directly fill the HDFS?

HDFS is a distributed file system and quickly raise an issue : how to fill this file system with all my data? There are several options that go from batch import to Straight Through Processing.

  • Bulk load style. The first one is to keep collecting data on local file system and importing them by vacation. The second one is to use an ETL. Pentaho has announced support of Hadoop for Data Integration product. The first tests we conducted lead us to think this works much better to extract data from Hadoop (using Hive) than to import data. Yet this is just a matter of time before Pentaho fixes the few issues we encountered. The third one is to use solution like Sqoop. Sqoop extracts (or imports) data from your RDBMS using Map/Reduce algorithm. I hope we will be able to talk about that solution very soon.
  • Straight Through Processing style. In that domain, you can look at solutions like Flume, Chukwa (which is part of Apache Hadoop distribution) or Scribe. In brief, you collect and agregate data in a more STP style, from different sources, different applications, different machines. They globally work the same way than Scribe but solutions like Flume or Chukwa provide more connectors than Scribe in a sense you can, for instance, “tail” a log file etc etc…Chukwa is also much easily integrated with the “Hadoop stack” than what Scribe could be.

Scribe : a kind of distributed collector

(more…)

Scribe installation

Scribe installation is a little bit tricky (I need to precise I am not what we can call a C++ compilation expert and thanks to David for his help…).
Here is so how I installed Scribe on my Ubuntu (Ubuntu 10.04 LTS – the Lucid Lynx – released in April 2010)
(more…)

Using Hadoop for Value At Risk calculation Part 6

In the first part, I described the potential interest of using Hadoop for Value At Risk calculation in order to analyze intermediate results. In the three (2,3, 4) next parts I have detailled how to implement the VAR calculation with Hadoop. Then in the fifth part, I have studied how to analyse the intermediate results with Hive. I will finally give you now some performance figures on Hadoop and compare them with GridGain ones. According to those figures, I will detail some performance key points by using such kind of tools. Finally, I will conclude why Hadoop is an interesting choice for such kind of computation.
(more…)

Using Hadoop for Value At Risk calculation Part 5

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part and third part and fourth part I have given two concrete implementations of VAR calculation with Hadoop with optimizations. Another interest of using Hadoop for Value At Risk calculation is the ability to analyse the intermediate values inside Hadoop through Hive. This is the goal of this (smaller) part of this series.
(more…)

Using Hadoop for Value At Risk calculation Part 4

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part and in the third part I have given two concrete implementations of VAR calculation with Hadoop. I will now give you some details about the optimizations used in those implementations.
(more…)

Using Hadoop for Value At Risk calculation Part 3

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part I have described a first implementation. One drawback of this previous implementation is that it does not take advantage of the reduce pattern. I did it by hand. I will now fully use Hadoop reduce feature.

(more…)

Using Hadoop for Value At Risk calculation Part 2

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In that second part I will give you a first concrete implementation of the VAR calculation with Hadoop.

(more…)

Using Hadoop for Value At Risk calculation Part 1

After introducing the Value at Risk in my first article, I have implemented it using GridGain in my second article. I conclude in this latter that relatively good performances have been reached through some optimizations. One of them was based on the hypothesis that the intermediate results – the prices for each draw – can be discarded. However, it is not always the case. Keeping the generated parameters and the call price for each draw can be very useful for business in order to analyze the influence of the different parameters. Such data are often crunched by Business Intelligence. VAR calculation is maybe not the best business use case in order to illustrate that need but I will reuse it as it has already been defined. The purpose of this new series of articles will be to compute the Value at Risk and keep all results, to be able to analyze them.

  • In this first part, I will describe how to persist that data both with GridGain and with Hadoop;
  • In the three next parts I will go into details of different implementations with Hadoop. These parts provide interesting code examples but can be safely discard in first read;
  • Then in the fifth part, I will show how to use Hadoop to do Business Intelligence on the data;
  • And in the last part, I will give some performance figures and improvement possibilities.

(more…)

How to “crunch” your data stored in HDFS?

HDFS stores huge amount of data but storing it is worthless if you cannot analyse it and obtain information.

Option #1 : Hadoop : the Map/Reduce engine

Hadoop Overview

Hadoop is a Map/Reduce framework that works on HDFS or on HBase. The main idea is to decompose a job into several and identical tasks that can be executed closer to the data (on the DataNode). In addition, each task is parallelized : the Map phase. Then all these intermediate results are merged into one result : the Reduce phase.

In Hadoop, The JobTracker (a java process) is responsible for monitoring the job, (more…)

Hadoop Distributed File System : Overview & Configuration

Hadoop Distributed File System can be considered as a standard file system butt it is distributed. So from the client point of view, he sees a standard file system (the one he can have on your laptop) but behind this, the file system actually runs on several machines. Thus, HDFS implements fail-over using data replication and has been designed to manipulate, store large data sets (in large file) in a write-one-read-many access model for files.
(more…)