Data

Data

POV: A streaming/communication platform for the data mesh

In 2021, a rich set of data is the soil that empowers the business of all the Internet Giants (GAFAM, NATU, …). Meanwhile, traditional companies are striving to remain competitive. Therefore, the mandatory acceleration of their business goes through a massive digitalization of their operations and assets. Amongst the most valuable digital assets are the data. Big data’s promises are attractive. However, the “data” organizational unit is commonly separated from the core business in the wild. Even if many of those departments provide much effort…

Lire la suite
Data

Data+AI Summit 2020 – be Zen in your lakehouse

In case you missed it, last week was held the first Data+AI Summit (formerly Spark+AI Summit) and we had a chance to participate. The talks will be published online but if you don’t want to wait, take a shortcut and learn our key insights! TL;DR Are you wondering why that title? It summarizes the major announcements from a data engineering perspective. Apache Spark will become more Pythonic thanks to the Project Zen initiative, and very probably, it will work on top of a lakehouse architecture.…

Lire la suite
Data

Time series features extraction using Fourier and Wavelet transforms on ECG data

ABSTRACT This article focuses on the features extraction from time series and signals using Fourier and Wavelet transforms. This task will be carried out on an electrocardiogram (ECG) dataset in order to classify three groups of people: those with cardiac arrhythmia (ARR), congestive heart failure (CHF) and normal sinus rhythm (NSR). Our approach consists of using scaleogram (i.e. 2D representation of 1D extracted features) as an input to train a Neural Network (NN). We conducted the different tasks using python as a programming language. The…

Lire la suite
Data

Accelerating NiFi flows delivery: Part 1

While working in different contexts with NiFi, we have faced recurring challenges of development, maintenance and deployment optimization of NiFi flows. Whereas the basic approach suggests to manually duplicate pipelines for similar patterns, we believe that an automated approach is relevant for production purpose when it comes to implementing a significant amount of ingestion flows relying on a limited set of patterns or, more simply, when it comes to deploying these flows on different environments of execution. The ability to reach the right level of…

Lire la suite
Data

Industrial document classification with Deep Learning

Knowledge is a goldmine for companies. It comes in different shapes and forms: mainly documents (presentation slides and documentation) that allow businesses to share information with their customers and staff. The way companies harness this knowledge is central to their ability to develop their business successfully. One of the common ways to ease the access to this document base is to use search engines based on textual data. At OCTO, we have decided to use optical character recognition (OCR) solutions to extract this data, since…

Lire la suite
Data

Confluent.io: Part 3 – STREAM PROCESSING

This article is part of a series designed to demonstrate the setup and use of the Confluent Platform. In this series, our goal is to build an end to end data processing pipeline with Confluent. Disclaimer: While knowledge of Kafka internals is not required to understand this series, it can sometimes help clear out some parts of the articles. In the previous articles, we set up two topics, one to publish the input data coming from PostgreSQL and another one to push the data from…

Lire la suite
Data

Confluent.io – Part 2: BUILD A STREAMING PIPELINE

This article is part of a series designed to demonstrate the setup and use of the Confluent Platform. In this series, our goal is to build an end to end data processing pipeline with Confluent. Disclaimer: While knowledge of Kafka internals is not required to understand this series, it can sometimes help clear out some parts of the articles. BASICS If you have gone through every step from our previous article, you should have a Kafka broker running along with Zookeeper and Control Center. Now,…

Lire la suite
Data

Confluent.io – Part 1: INTRODUCTION & SETUP

This article is part of a series designed to demonstrate the setup and use of the Confluent Platform. In this series, our goal is to build an end to end data processing pipeline with Confluent. Disclaimer: While knowledge of Kafka internals is not required to understand this series, it can sometimes help clear out some parts of the articles. INTRODUCTION Let’s begin with these two questions: what is the Confluent Platform and why use it? What? The Confluent Platform is a data streaming platform built…

Lire la suite
Data

Visualizing massive data streams: a public transport use case

Public transport companies release more data every day and some of them are even opening their information system up to real time streaming (Swiss transport, TPG in Geneva, RATP in Paris are a couple of local ones). Vast lands are unveiled for technical experimentations! Beside real time data, these companies also publish their full schedules. In Switzerland, it describes trains, buses, tramways, boats and even gondolas. In this post, we propose to walk through an application built to visualize, in fast motion, one day of activity,…

Lire la suite