Big Data

Big Data

Visualizing massive data streams: a public transport use case

Public transport companies release more data every day and some of them are even opening their information system up to real time streaming (Swiss transport, TPG in Geneva, RATP in Paris are a couple of local ones). Vast lands are unveiled for technical experimentations!

Beside real time data, these companies also publish their full schedules. In Switzerland, it describes trains, buses, tramways, boats and even gondolas.

In this post, we propose to walk through an application built to visualize, in fast motion, one day of activity, as shown in this movie. As real time data are not yet available, they were simulated, based on available schedule information. This pretext is too good not to dig into a stack containing Play/Scala/Akka on the backend, Angular2/Pixi.js/D3.js/topojson in the browser, linked together by Server Side Events.

This prototype is intended to explore the possibility of doing massive geographical visualization in the browser, applying techniques described in a previous post.

The backend and frontend code is available on github, and tests continuously ran on travis-ci.

Read more

Big Data

A quick summary and some thoughts on the Scikit-learn workshop

On december 2nd was given at Telecom ParisTech the workshop : “Using Scikit-learn and Scientific Python at Scale” with top contributors from the project as speakers. This workshop was divided into four talks :

  1.    Scikit-learn for industrial applications, basic research and mind reading – Alexandre Gramfort
  2.    Distributed computing for predictive modeling in Python – Olivier Grisel
  3.    Scikit-learn at scale : out-of-core methods – Thierry Guillemot
  4.    An Industrial application at Airbus Group – Vincent Feuillard

Scikit-learn is currently the most widely used open source library for Machine Learning applications. It has been developed in Python (Cython and C/C++) and, with over 1000 documentation pages, has become the major contribution for democratizing machine learning for a large audience.

Read more

Big Data

D3.js transitions killed my CPU! A d3.js & pixi.js comparison

www.octo.chD3.js certainly is the most versatile JavaScript data rendering library available: turning data into mind blowing visualizations is only limited by your imagination. A key component to turn static pages into animated ones are the powerful selection transitions. However, too many simultaneous transitions on a web page will soon bring you CPU on its knees.
Hence this blog post.

We faced this problem when displaying swiss transport real time data on a map, within an SVG layout: rendering was lagging, event sourced data were not consumed consistently and laptop batteries were drowning at a dramatic speed. A video from a first attempt can be seen, and compared to a newer implementation with the technique presented in this article. Another surprise came from rendering a simple clock, burning 20% of CPU with a single transition.

If d3.js has no serious concurrents for many rendering problems, we decided to give try to a JavaScript library used for building games and leveraging the strengths of HTML5 and GPU: pixi.js.

At first, we will propose in this post a comparison between the two libraries in terms of rendering performance. For the sake of completeness, we will also discuss native CSS transitions. We will then dive into a couple of tricks to enhance dynamic visualizations with each of the two libraries and will even combine them to get the best of both worlds.

The project source code with benchmark data are hosted on github and a demo is available on

Read more

Big Data

A Journey into Industrializing the Writing and Deployment of Kibana Plugins (riding Docker)

by Alexandre Masselot (OCTO Technology Switzerland), Catherine Zwahlen (OCTO Technology Switzerland) and Jonathan Gianfreda.

The possibility of custom plugins is a strong Kibana promise. We propose an end to end tutorial to write such plugins. But this “end to end” approach also means “how to continuously deploy them?”, “how to share an environment with seeded data?” Those questions will bring us in a full fledged integration infrastructure, backed by Docker.

The Elasticsearch has grown from a Lucene evolution to a full fledged distributed document store, with powerful storage, search and aggregation capabilities. Kibana has definitely brought a strong component for interactive searching and visualization, transforming the data storage tier into an end user browser.

Customizable dashboards via a rich library of graphical components made its success, but soon, the need for real customization arose. If plugins were thought to be integrated from early on, the actual customization often lied into forking the master project and adapting to on particular purpose. Merging back fixes was soon to be a daunting effort to keep up with the high pace of the Github repository evolution .

Fortunately, as of version 4.3, the Kibana project took a more structured way to integrate custom plugins. The promise of maintainable external plugins became true. Those plugins, written in JavaScript, can be as simple as a standalone widget (e.g. a clock), a field formater (an up/down arrow instead of positive/negative number), a graphical representation of a search result (a chart) or a full blown application.

So, that should be easy. Just google and you would craft wonderful shiny visualizations.

But not fast, young Kibana Padavan! Documentation lacks, resources are valuable but scarce. But the promise is still shiny and we want to reach it.

In this post, we propose to share our journey into the writing of Kibana plugins, the little pitfalls we fell in and the setup of continuous deployment into a Docker environment. There is no dramatic discovery or stunning breakthrough today, but a tentative to write a map to make your journey easier.

Read more

Big Data

A chat with Doug Cutting about Hadoop

We had the chance to interview Doug Cutting during the Cloudera Sessions in Paris, October 2014. Doug is the creator behind Hadoop and Cloudera’s Chief Architect. Here is our exchange below:


A question is: how does it feel to see that Hadoop is actually becoming the must have, the default way of storing and computing over data in large enterprise companies?

Rationally it feels very good. It’s a technology that’s supposed to do that. Emotionally it’s very satisfying, but also I must say I must be very lucky. I was in the right place at the right time and happened to be the person. Someone else would have done this had I not, by now.


Download our white paper “Hadoop Roadmap”

It’s funny because yesterday you were mentioning how Google released that paper about GFS and then about MapReduce, and you seemed surprised that no one else has gone and implemented the paper. How would you describe this, because it was a very big, big task that some people were daunted by taking on or…?

I think, again, I have the right experience from having put some work in open source. I worked on search engines and I could see the value in the technology, I understood the problem, and that combination. And I think I’ve also been in the software business long enough so that’s why I knew what it’d take to build a project that would be useful, that would be used. And I think no one else was positioned ready enough in the competition with that combination of properties. I’ve been able to take advantage of these papers and implement them as open source, and get them out to people. My guess, I don’t know. It wasn’t my plan.

Read more

Big Data

Geo localizing Medline citations

Where are the scientific publications coming from? Geolocalizing Medline citations

www.octo.chWhen and where are the scientific publications coming from? Which country are collaborating the most? To investigate those questions, we focused on Medline, the major biology and biomedical peer reviewed citations repository.

Big Data is not only a buzz word. A rich ecosystem of tools have emerged, together with new architectural paradigms, to tackle large problems. Open data are flowing around, waiting for new analysis angles. We have focused on the Medline challenge to demonstrate what can be achieved.

To provide some insights on how an interactive web application was built to explore such data, we will discuss the geographic localization method based on free text affiliation, Hadoop oriented treatment with Scala and Spark, interactive analysis with the Zeppelin notebook and rendering with React, a modern JavaScript framework. The code has been open sourced on github [1, 2] and the application is available on Amazon AWS.

Read more

Big Data

The evolution of bottlenecks in the Big Data ecosystem

I propose in this paper a chronological review of the events and ideas that have contributed to the emergence of Big Data technologies of today and tomorrow. What we can see regarding bottlenecks is that they move according to the technical progress we make. Today is the JVM garbage collector, tomorrow will be a different problem.

Here is my side of the story:
Read more

Big Data

Big data : some myths

At my hairdresser’s, on the coffee table, I came across one of those hype men’s magazines with a model on the cover and the promise to learn how to avoid 10 common mistakes when wearing a tie. I accidentally open the page 34: “The Big Data revolution.”


Read more