Scribe : a way to aggregate data and why not, to directly fill the HDFS?
HDFS is a distributed file system and quickly raise an issue : how to fill this file system with all my data? There are several options that go from batch import to Straight Through Processing.
- Bulk load style. The first one is to keep collecting data on local file system and importing them by vacation. The second one is to use an ETL. Pentaho has announced support of Hadoop for Data Integration product. The first tests we conducted lead us to think this works much better to extract data from Hadoop (using Hive) than to import data. Yet this is just a matter of time before Pentaho fixes the few issues we encountered. The third one is to use solution like Sqoop. Sqoop extracts (or imports) data from your RDBMS using Map/Reduce algorithm. I hope we will be able to talk about that solution very soon.
- Straight Through Processing style. In that domain, you can look at solutions like Flume, Chukwa (which is part of Apache Hadoop distribution) or Scribe. In brief, you collect and agregate data in a more STP style, from different sources, different applications, different machines. They globally work the same way than Scribe but solutions like Flume or Chukwa provide more connectors than Scribe in a sense you can, for instance, “tail” a log file etc etc…Chukwa is also much easily integrated with the “Hadoop stack” than what Scribe could be.

