Hadoop

Archi & Techno

Big data : some myths

At my hairdresser’s, on the coffee table, I came across one of those hype men's magazines with a model on the cover and the promise to learn how to avoid 10 common mistakes when wearing a tie. I accidentally open the page 34: "The Big Data revolution."  

Read more
Archi & Techno

Hadoop 2 stable release is coming and why you should care

The first beta version of Hadoop 2 has just been released. It is the 2.1.0. More interesting, the stable version is expected to follow by mid-September. Only a few bugs away from Hadoop 2! That's a good news but you might be wondering why you should care about it? After all, the most important is what this new major version can bring to your next datalab or to your production cluster, isn't it? In this article, we will cover the differences between Hadoop 1 and…

Read more
Archi & Techno

Hadoop in my IT department: benchmark your cluster

The stress test is a very important step when you go live. Good stress tests help us to: ensure that the software meets its performances requirements ensure that the service will deliver a fast response time even under a heavy load get to now the scalability limits which in turn is useful to plan the next steps of the development Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need…

Read more
Archi & Techno

Hadoop in my IT department: How to plan a cluster?

Ok, you have decided to setup a Hadoop cluster for your business. Next step now, planning the cluster… But Hadoop is a complex stack and you might have many questions: HDFS deals with replication and Map Reduce create files… How can I plan my storage needs? How to plan my CPU needs? How to plan my memory needs? Should I consider different needs on some nodes of the cluster? I heard that Map Reduce moves its job code where the data to process is located……

Read more
Big Data

Introduction to Datastax Brisk : an Hadoop and Cassandra distribution

As the Apache Hadoop ecosystem grows while its core matures, there are now several companies providing business-class Hadoop distribution and services. While EMC, after it acquires Greenplum, seem the biggest player other companies such as Cloudera or MapR are also competing. This article introduces Datastax Brisk, an innovative Hadoop distribution that leverage Apache Hive data warehouse infrastructure on top of an HDFS-compatible storage layer, based on Cassandra. Brisk try to reconcile real-time applications with low-latency requirement (OLTP) and big data analytics (OLAP) in one system.…

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 6

In the first part, I described the potential interest of using Hadoop for Value At Risk calculation in order to analyze intermediate results. In the three (2,3, 4) next parts I have detailled how to implement the VAR calculation with Hadoop. Then in the fifth part, I have studied how to analyse the intermediate results with Hive. I will finally give you now some performance figures on Hadoop and compare them with GridGain ones. According to those figures, I will detail some performance key points…

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 5

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part and third part and fourth part I have given two concrete implementations of VAR calculation with Hadoop with optimizations. Another interest of using Hadoop for Value At Risk calculation is the ability to analyse the intermediate values inside Hadoop through Hive. This is the goal of this (smaller) part of this series.

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 4

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part and in the third part I have given two concrete implementations of VAR calculation with Hadoop. I will now give you some details about the optimizations used in those implementations.

Read more
Archi & Techno

Using Hadoop for Value At Risk calculation Part 3

In the first part of this series, I have introduced why Hadoop framework could be useful to compute the VAR and analyze intermediate values. In the second part I have described a first implementation. One drawback of this previous implementation is that it does not take advantage of the reduce pattern. I did it by hand. I will now fully use Hadoop reduce feature.

Read more