bigdata

Data

Data+AI Summit 2020 – be Zen in your lakehouse

In case you missed it, last week was held the first Data+AI Summit (formerly Spark+AI Summit) and we had a chance to participate. The talks will be published online but if you don’t want to wait, take a shortcut and learn our key insights! TL;DR Are you wondering why that title? It summarizes the major announcements from a data engineering perspective. Apache Spark will become more Pythonic thanks to the Project Zen initiative, and very probably, it will work on top of a lakehouse architecture.…

Lire la suite
Data

Time series features extraction using Fourier and Wavelet transforms on ECG data

ABSTRACT This article focuses on the features extraction from time series and signals using Fourier and Wavelet transforms. This task will be carried out on an electrocardiogram (ECG) dataset in order to classify three groups of people: those with cardiac arrhythmia (ARR), congestive heart failure (CHF) and normal sinus rhythm (NSR). Our approach consists of using scaleogram (i.e. 2D representation of 1D extracted features) as an input to train a Neural Network (NN). We conducted the different tasks using python as a programming language. The…

Lire la suite
Data

Accelerating NiFi flows delivery: Part 1

While working in different contexts with NiFi, we have faced recurring challenges of development, maintenance and deployment optimization of NiFi flows. Whereas the basic approach suggests to manually duplicate pipelines for similar patterns, we believe that an automated approach is relevant for production purpose when it comes to implementing a significant amount of ingestion flows relying on a limited set of patterns or, more simply, when it comes to deploying these flows on different environments of execution. The ability to reach the right level of…

Lire la suite
Archi & Techno

Industrial document classification with Deep Learning

Knowledge is a goldmine for companies. It comes in different shapes and forms: mainly documents (presentation slides and documentation) that allow businesses to share information with their customers and staff. The way companies harness this knowledge is central to their ability to develop their business successfully. One of the common ways to ease the access to this document base is to use search engines based on textual data. At OCTO, we have decided to use optical character recognition (OCR) solutions to extract this data, since…

Lire la suite
Évènement

Afterwork à Genève le jeudi 10 novembre « Data Science & Machine Learning : explorer, comprendre et prédire »

Pour notre troisième Afterwork sur le thème du "Big Data", nous proposons une introduction aux pratiques et bénéfices de la Data Science. Si les précédentes sessions ont dévoilé comment stocker et traiter de gros volumes de données à moindre coût, nous aborderons un nouvel aspect : comment découvrir les trésors d'information présents dans vos données.

Lire la suite
Archi & Techno

Hadoop in my IT department: benchmark your cluster

The stress test is a very important step when you go live. Good stress tests help us to: ensure that the software meets its performances requirements ensure that the service will deliver a fast response time even under a heavy load get to now the scalability limits which in turn is useful to plan the next steps of the development Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need…

Lire la suite
Archi & Techno

Hadoop in my IT department: How to plan a cluster?

Ok, you have decided to setup a Hadoop cluster for your business. Next step now, planning the cluster… But Hadoop is a complex stack and you might have many questions: HDFS deals with replication and Map Reduce create files… How can I plan my storage needs? How to plan my CPU needs? How to plan my memory needs? Should I consider different needs on some nodes of the cluster? I heard that Map Reduce moves its job code where the data to process is located……

Lire la suite
Archi & techno

La crise économique, une opportunité à ne pas rater !

La crise économique, une opportunité à ne pas rater ! L’actualité financière et les perspectives économiques mettent sous pression les budgets des DSI. Les coups de rabot budgétaires ont le mérite de mettre en évidence les sujets perçus comme les plus importants et urgents. Arbitrer entre maintenir à flot l’activité métier ou la transformer est un choix difficile pour chaque Direction : « run the business or change the business » ? Les investissements SI sont soumis également à ce dilemme. Néanmoins, les entreprises qui…

Lire la suite
Consulting Chronicles

Introduction to Datastax Brisk : an Hadoop and Cassandra distribution

As the Apache Hadoop ecosystem grows while its core matures, there are now several companies providing business-class Hadoop distribution and services. While EMC, after it acquires Greenplum, seem the biggest player other companies such as Cloudera or MapR are also competing. This article introduces Datastax Brisk, an innovative Hadoop distribution that leverage Apache Hive data warehouse infrastructure on top of an HDFS-compatible storage layer, based on Cassandra. Brisk try to reconcile real-time applications with low-latency requirement (OLTP) and big data analytics (OLAP) in one system.…

Lire la suite