Hadoop in my IT department: How to plan a cluster?

Ok, you have decided to setup a Hadoop cluster for your business.

Next step now, planning the cluster… But Hadoop is a complex stack and you might have many questions:

  • HDFS deals with replication and Map Reduce create files… How can I plan my storage needs?
  • How to plan my CPU needs?
  • How to plan my memory needs? Should I consider different needs on some nodes of the cluster?
  • I heard that Map Reduce moves its job code where the data to process is located… What does it involve in terms of network bandwidth?
  • At which point and how far should I consider what the final users will actually process on the cluster during my planning?

That is what we are trying to make clearer in this article by providing explanations and formulas in order to help you to best estimate your needs.

(more…)

Android Testing :: testing private methods

This article is about testing private methods in android. This is a fairly common problem in android (even in Java at large) and can be solved easily. The technique proposed here provides the additionnal benefit of using a traditional way of solving the problem in the Java world. (suspense :) )

Using the android platform, you are used to divide your application into two projects :

  • one for the main source code of your application,
  • one for the tests

(more…)

Batman rises in Monte-Carlo

I had the chance, with Alexis Flaurimont, to speak about the usefulness of parallel programming at Breizh C@mp this year. One of the goals was to demonstrate that parallel programming is a lot easier to code than a couple of years ago.

During the presentation, we used the Monte Carlo method. It is, I must confess, an embarrassingly parallel algorithm. Perfect to demonstrate that parallelization can greatly improve an application performances.

(more…)

iOS dev: How to get your code coverage right?

When I decided to tackle my preceding blog article on quality metrics for iOS, I wasn’t prepared to spend that much time to get something robust and correct.

The part on which I stumbled most was the code coverage, not because it’s that difficult to make it work (there is plenty of resources on the Web) but because in all articles I have seen the solution was working but was not reporting accurate and useful metrics (I am sure I have missed some, sorry for this if this is the case).

Note: the fact that Xcode support for it is very fragile and has changed with almost each version of Xcode did not ease this and explain why people were first focused on making it work.

Here are some of the pitfalls I saw in all the articles talking about code coverage:

  • Pitfall #1: only the files under test are reported in the code coverage report. It means you do know the coverage on what you did test but not on what you have not tested. This is the biggest pitfall according to me.
  • Pitfall #2: third-parties libraries and test files are impacting the coverage figures
  • Pitfall #3: the report is not structured so difficult to analyze and make it actionable
  • Pitfall #4: no article makes the difference between GHUnit and OCUnit, even if there are some indeed

The setup I proposed in the preceding article is still valid and will avoid you these pitfalls. As the article was already long enough, I decided not to make it longer and keep all the detailed explanations for a new article. Here it is.

(Read more…)

iOS dev: How to setup quality metrics on your Jenkins job?

iOS development projects are not first-in-class when it comes to managing the quality of the software produced.

Very short projects, very short time-to-market, it is not the kind of projects where you see a lot of attention towards quality, unfortunately.

Here at OCTO we try to do it differently, even for this kind of projects. Or above all, for that matter. But here comes another issue: the lack of tooling.

Here is one of our latest attempt to setup quality metrics on a short and budget constrained iOS project. And yes, a teaser, look at our dashboard at the end of this six weeks (six Agile iterations) project:

Screenshot of Jenkins dashboard on a iOS project

The following article will detail how to setup all these quality metrics in a integrated report in Jenkins (Continuous Integration).

(Read more…)

Graph databases: an overview

In a previous article, we introduced a few concepts related to graphs, and illustrated them with two examples using the Neo4j graph database.

For the previous years, many companies have been developing graph databases — as software vendors like Neo Technology (Neo4j), Objectivity (InfiniteGraph), Sparsity (dex*), or by building their own custom solution to integrate it into their applications, like LinkedIn or Twitter.
Thus it can be hard to grasp a global picture of this rich landscape in continuous evolution.
In this new article focused on graph databases, we will give you the elements that are necessary to understand how they fit into the ecosystem, compared to the other kinds of databases and to the other types of graph processing tools.
Specifically, we will try to answer an important question — when to use a graph database and when not use one. The answer is not that obvious.

(more…)

Introduction to large-scale graph processing

Graphs are very attractive when it comes to modelling real-world data, because they are intuitive, flexible (more than tables and rows in a RDBMS), and because the theory supporting them has been maturing for centuries. As a consequence, there are several graph databases available, Neo4j being one of the most renowned.

The same goes for graph processing, algorithms are numerous and well understood and have immediate applications: single-source shortest path, route finding, loop detection, subgraph matching, … to name a few. Neo4j comes with a small collection of such algorithms built in its graphalgo package.

Problems arise when processing very large graphs, when visiting billions of highly connected vertices. In such cases a graph can’t fit on a single machine, and the implementation resorts to a big batch distributed over a cluster of machines. Algorithms typically follow the edges of the graph, so a naïve approach will introduce significant overhead due to machine-to-machine communication (partitioning the graph optimally across the cluster is little more than partitioning the graph without heuristics, a hard problem).

We will give an overview of the use of BSP (Bluk-Synchronous Parallel), an algorithm often used for processing of such large graphs. Then we will explore a list of products and frameworks dedicated to graph processing, either using BSP or based on other approaches.

(more…)

HTTP caching with Nginx and Memcached

Deploying an HTTP cache in front of web servers is a good way to improve performances. This post has two goals :

  • present the basics of HTTP Caching
  • present the new features I have implemented in the Memcached Nginx module to simplify HTTP caching

(more…)

The Esper CEP ecosystem

Mathieu’s introduction to Complex Event Processing (CEP) has announced a series of articles on various CEP solutions. We begin this series with a post about Esper.

Esper, maintained by EsperTech, is a Java platform dedicated to complex event processing and event stream processing (ESP), that is, a collection of frameworks and tools that can be combined to build event-oriented applications and integrate them together.

Most of the foundation of such applications is brought by the 3 different Esper packages. These are advertised as “editions”, but are rather complementary building blocks.

Building an event-oriented application with Esper involves:

  • coding the main applicative logic with event processing statements, using the core algorithmic engine Esper Event Stream and Complex Event Processing (Esper Engine for short). It is distributed as an open-source project with a GPLv2 license
  • packaging, integrating and deploying the application. A good candidate for these tasks is the dedicated Esper Enterprise Edition (EsperEE)
  • optionally securing the event processing logic with EsperHA, which brings persistence capabilities to Esper, thus enabling high-availability and recovery scenarios

This post thus offers a global vision of the Esper ecosystem as a CEP platform. The article concerns version 4.5.0 of the platform, but at the time of this writing the latest released version is 4.6.0.

(more…)

Getting from shell to Puppet

After this (french) article, dealing with managing servers with shell scripts (what we were doing), and this (also french) one, which dealt with tools for automated deployment (what we planed to do), including Puppet, here is the article about Puppet. By doing. With blood, tears, and victories ;)

Because yes, going from servers managed by shell scripts to Puppet, when you don’t know Puppet, it’s not so easy.

(more…)