Map/Reduce

Big Data

Introduction to Datastax Brisk : an Hadoop and Cassandra distribution

As the Apache Hadoop ecosystem grows while its core matures, there are now several companies providing business-class Hadoop distribution and services. While EMC, after it acquires Greenplum, seem the biggest player other companies such as Cloudera or MapR are also competing. This article introduces Datastax Brisk, an innovative Hadoop distribution that leverage Apache Hive data warehouse infrastructure on top of an HDFS-compatible storage layer, based on Cassandra. Brisk try to reconcile real-time applications with low-latency requirement (OLTP) and big data analytics (OLAP) in one system.…

Read more
Archi & Techno

My reading of Percolator architecture: a Google search engine component

In April 2010, Google updated its indexing system. Caffeine - the name of this project - was pretty transparent for the large public but represents an in depth change for Google. It does not directly improve the search page, like instant search, but the indexing mechanism, the way to provide pertinent search results. For the end user, this change allows reducing the delay between when a page is founded and when it is made available in the Google search. Google has recently published a research…

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 6

In the first part, I described the potential interest of using Hadoop for Value At Risk calculation in order to analyze intermediate results. In the three (2,3, 4) next parts I have detailled how to implement the VAR calculation with Hadoop. Then in the fifth part, I have studied how to analyse the intermediate results with Hive. I will finally give you now some performance figures on Hadoop and compare them with GridGain ones. According to those figures, I will detail some performance key points…

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 4

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part and in the third part I have given two concrete implementations of VAR calculation with Hadoop. I will now give you some details about the optimizations used in those implementations.

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 3

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part I have described a first implementation. One drawback of this previous implementation is that it does not take advantage of the reduce pattern. I did it by hand. I will now fully use Hadoop reduce feature.

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 1

After introducing the Value at Risk in my first article, I have implemented it using GridGain in my second article. I conclude in this latter that relatively good performances have been reached through some optimizations. One of them was based on the hypothesis that the intermediate results - the prices for each draw - can be discarded. However, it is not always the case. Keeping the generated parameters and the call price for each draw can be very useful for business in order to analyze…

Read more
Archi & Techno

How to “crunch” your data stored in HDFS?

HDFS stores huge amount of data but storing it is worthless if you cannot analyse it and obtain information. Option #1 : Hadoop : the Map/Reduce engine Hadoop Overview Hadoop is a Map/Reduce framework that works on HDFS or on HBase. The main idea is to decompose a job into several and identical tasks that can be executed closer to the data (on the DataNode). In addition, each task is parallelized : the Map phase. Then all these intermediate results are merged into one result…

Read more