Data

Data

A quick summary and some thoughts on the Scikit-learn workshop

On december 2nd was given at Telecom ParisTech the workshop : “Using Scikit-learn and Scientific Python at Scale” with top contributors from the project as speakers. This workshop was divided into four talks :    Scikit-learn for industrial applications, basic research and mind reading - Alexandre Gramfort    Distributed computing for predictive modeling in Python - Olivier Grisel    Scikit-learn at scale : out-of-core methods - Thierry Guillemot    An Industrial application at Airbus Group - Vincent Feuillard Scikit-learn is currently the most widely used open source library…

Lire la suite
Data

D3.js transitions killed my CPU! A d3.js & pixi.js comparison

D3.js certainly is the most versatile JavaScript data rendering library available: turning data into mind blowing visualizations is only limited by your imagination. A key component to turn static pages into animated ones are the powerful selection transitions. However, too many simultaneous transitions on a web page will soon bring you CPU on its knees. Hence this blog post. We faced this problem when displaying swiss transport real time data on a map, within an SVG layout: rendering was lagging, event sourced data were not…

Lire la suite
Data

A Journey into Industrializing the Writing and Deployment of Kibana Plugins (riding Docker)

by Alexandre Masselot (OCTO Technology Switzerland), Catherine Zwahlen (OCTO Technology Switzerland) and Jonathan Gianfreda. The possibility of custom plugins is a strong Kibana promise. We propose an end to end tutorial to write such plugins. But this "end to end" approach also means "how to continuously deploy them?", "how to share an environment with seeded data?" Those questions will bring us in a full fledged integration infrastructure, backed by Docker. The Elasticsearch has grown from a Lucene evolution to a full fledged distributed document store, with powerful storage,…

Lire la suite
Data

A chat with Doug Cutting about Hadoop

We had the chance to interview Doug Cutting during the Cloudera Sessions in Paris, October 2014. Doug is the creator behind Hadoop and Cloudera's Chief Architect. Here is our exchange below: A question is: how does it feel to see that Hadoop is actually becoming the must have, the default way of storing and computing over data in large enterprise companies? Rationally it feels very good. It’s a technology that’s supposed to do that. Emotionally it’s very satisfying, but also I must say I must…

Lire la suite
Data

Geo localizing Medline citations

Where are the scientific publications coming from? Geolocalizing Medline citations When and where are the scientific publications coming from? Which country are collaborating the most? To investigate those questions, we focused on Medline, the major biology and biomedical peer reviewed citations repository. Big Data is not only a buzz word. A rich ecosystem of tools have emerged, together with new architectural paradigms, to tackle large problems. Open data are flowing around, waiting for new analysis angles. We have focused on the Medline challenge to demonstrate…

Lire la suite
Data

Gather shopping receipts: architecture overview

Following our first post (in French) concerning the business challenges raised by the data collection and analysis in the retail sector, we will now present a use case with its associated issues. We will see how to face them based on modern technologies that have already proven themselves in Web giants: Kafka, Spark and Cassandra.

Lire la suite
Data

The evolution of bottlenecks in the Big Data ecosystem

I propose in this paper a chronological review of the events and ideas that have contributed to the emergence of Big Data technologies of today and tomorrow. What we can see regarding bottlenecks is that they move according to the technical progress we make. Today is the JVM garbage collector, tomorrow will be a different problem. Here is my side of the story:

Lire la suite
Data

Big data : some myths

At my hairdresser’s, on the coffee table, I came across one of those hype men's magazines with a model on the cover and the promise to learn how to avoid 10 common mistakes when wearing a tie. I accidentally open the page 34: "The Big Data revolution."  

Lire la suite
Data

Hadoop in my IT department: benchmark your cluster

The stress test is a very important step when you go live. Good stress tests help us to: ensure that the software meets its performances requirements ensure that the service will deliver a fast response time even under a heavy load get to now the scalability limits which in turn is useful to plan the next steps of the development Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need…

Lire la suite