A quick summary and some thoughts on the Scikit-learn workshop

le 17/01/2017 par David Luz
Tags: Data & AI

On december 2nd was given at Telecom ParisTech the workshop : “Using Scikit-learn and Scientific Python at Scale” with top contributors from the project as speakers. This workshop was divided into four talks :

  1. Scikit-learn for industrial applications, basic research and mind reading - Alexandre Gramfort
  2. Distributed computing for predictive modeling in Python - Olivier Grisel
  3. Scikit-learn at scale : out-of-core methods - Thierry Guillemot
  4. An Industrial application at Airbus Group - Vincent Feuillard

Scikit-learn is currently the most widely used open source library for Machine Learning applications. It has been developed in Python (Cython and C/C++) and, with over 1000 documentation pages, has become the major contribution for democratizing machine learning for a large audience.

A detailed presentation and outline of the talks can be found here.

Introduction (by A. Gramfort) :

The program focused on the following subjects :

  • a background story about scikit-learn,
  • an explanation of why the project worked,
  • an overview of the methods with out-of-core support available in scikit-learn,
  • new libraries for scaling the development of scikit-learn applications,
  • a business application with an industrial use-case.

Scikit-learn for industrial applications, basic research and mind reading (by A. Gramfort)

This first talk gave a general introduction and presentation of the sklearn project. The speaker, who is one of the top committers and a major contributor, told us about the beginning of the project and highlighted some of the reasons that made it so successful.

Some facts :

  • Start of scikit-learn : official start in 2010 at Université Paris-Saclay (started in 2006 actually at Google’s Summer of Code)
  • 650+ contributors
  • 20K+ commits
  • Funding : INRIA, Paris-Saclay Center for Data Science, Télécom ParisTech, NYU, Google, Criteo.
  • Installed on 1% of Debian systems,
  • 1200 job offers on stack overflow,
  • Usage : 60% academy / 40% industry
  • Biggest python library for machine learning
  • ...

Scikit-learn was thought to be domain agnostic (with the exception of text vectorization which focuses on text analysis) and designed to be able to perform some highly non-trivial tasks in a few lines of code.

Some quotes :

"Machine Learning is easy, there is scikit-learn" - Gaël Varoquaux

"But making scikit-learn was not easy !" - Anonymous scikit-learn developer

The Ingredients of success :

  • Technical reasons

    • nice web site with doc and examples
    • code tests, continuous integration
    • mailing list
    • rules on how to contribute
    • short release cycles
    • version control (use git)
  • Even more important reasons

    • improve upon existing project rather than creating something from scratch
    • clearly defined goal and scope
    • keep bounds on the technical difficulty
    • minimize dependencies
    • focus on not owning the project
    • good choice of license – scikit-learn uses BSD allowing commercial use
  • Social reasons

    • grow a community of contributors
    • git, github – review and give feedback
    • coding sprints with pair programming, code reviews
  • Researchers’ contributions

    • « alone you go fast, together you go far »
    • understanding that good software is crucial to advance research
    • a single API to learn a model : scikit-learn’s simple API is often copied by others (most python machine learning packages, spark mllib)
  • Scaling the development of the scikit-learn ecosystem

Examples of scikit-learn on some use cases :

Our take on this talk :

Scikit-learn’s success surely lies on the expertise in machine learning of the team who developed it, but first and foremost because its earliest contributors and founders were good coders convinced that code quality and maintainability were crucial assets for the project. It is also a case study for a hugely successful open source project made on a small budget and with the mindset to serve a community of diverse users and to democratize machine learning.

Distributed computing for predictive modeling in Python (by O. Grisel)

The slides of this talk can be found here. This talk was given initially at PyData Berlin - 2016.

NB : outdated informations on slide 28, you can distribute merge & group by today.

This talk began with an introduction questioning the real need for distributing predictive modeling as of today. The speaker based his reflexion on an article (« Big RAM is eating big data » by S. Pafka) stating that for the most part, datasets size is increasing by 20% year on year on average, but Big EC2 instances RAM size is increasing by 50% y/y. So why do distributed computing at a time when you can do almost anything in memory ? This analysis was tempered by the fact that this study relied on KDnuggets surveys (conducted yearly since 2006) that could possibly be biased, and that some datasets of several petabytes captured in the surveys do actually require the need for distributed computations.

The talk then focused on the approach for running predictive models. There are basically two ways : the “fast lane”, with distributed events stream processing for real time applications, and the “slow lane”, based on distributed storage and offline distributed batch processing. There are several alternatives to do this but the speaker focused mainly on the current Spark/Scala/Python paradigm and on Dask as an alternative. PySpark has the limitations of latency, which is induced by network architecture, and that traceback is complex due to the mix of Python and Scala code. There is no pure python mode ! The alternative is to use Dask and distributed.

In summary, the paradigm is to wrap the functions in delayed mode (which means a promise that the function will be executed in the future), then pass the delayed objects to the cluster for scheduled computation. This approach has the advantage of lower overhead than the Hadoop/Map Reduce framework. With Dask we can compute the delayed evaluation in parallel (multiple threads on a single machine or multiple threads on multiple Python processes running on several machines) or on a  single machine (single thread sequential code, easier to debug).

Our take on this talk :

Dask distributed seems to be a promising tool to distribute tasks on a cluster using python, with some interesting advantages over the current PySpark approach.

Scikit-learn at scale : out-of-core methods (by T. Guillemot)

The definition of out-of-core is what does not fit in RAM. From Wikipedia :

“Out-of-core or external memory algorithms are algorithms that are designed to process data that is too large to fit into a computer’s main memory at one time. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory (auxiliary memory) such as hard drives or tape drives.”

What are the strategies to scale Scikit-learn computationally? The speaker presented examples of incremental learning :

  • too many samples ? → Use mini batch, but for some algorithms the final result is not exactly the same (as with the classical algorithm)
  • too many descriptors ? → Use dimensionality reduction techniques

Scikit-learn proposes several methods to solve out-of-core problems (basically classes with a 'partial_fit' method). Rather than call 'fit', you must call 'partial_fit'.

Some algorithms recently added to the sklearn library using the partial fit method :

  • Classification :

    • MultinomialNB
    • BernoulliNB
    • Perceptron
    • SGDClassifier
    • PassiveAgressiveClassifier
    • Neural nets (part of last release)
  • Clustering :

    • MinibatchKMeans
  • Other :

    • IncrementalPCA

This link will give you all available out-of-core method of scikit-learn.

Scikit-learn proposes some tools that can be useful to deal with these problems :

  • "Hashing Vectorizer" for text analysis
  • "RBFSampler" to apply a kernel approximation

For more information about out-of-core problem in scikit-learn, you can read this article.

For more information about feature extraction, you can read the related documentation.

A notebook about large scale text classification can be found on that page.

Our take on this talk :

Some new algorithms are now available that use incremental learning for training models with large datasets.

An Industrial application at Airbus Group (by V. Feuillard)

This talk was given by Vincent Feuillard, R&D engineer in applied mathematics at Airbus Group Innovation. Unlike the other talks, this one was not especially focused on how one can go to production using sklearn. It was more of a storytelling on how the team could setup a prototype on a specific use case : predictive (condition based) maintenance using several signals from the airplane engine’s Auxiliary Power Unit (APU).

Before the prototype maintenance was based on engine health indicators defined by the expert engineers. The idea behind the prototype has been to approach the problem from a machine learning perspective and to validate the results by experts from the AiRTHM team.

The python stack they used for this project was :

  • for data munging : pandas
  • for dataviz : matplotlib, bokeh
  • for machine learning : scipy, sklearn

In contrast with R, Python Scikit-learn is best suited and easier to use when prototyping from the beginning to the very end of the pipeline, because it is better maintained, more stable and has a clear API. The main lesson learned is that feature engineering is the most important step when doing anomaly detection with functional data.

Our take on this talk :

The POC presented is a nice application of the Python/sklearn stack to an industrial business case. The speaker highlighted the multidisciplinary teamwork and the Agile organization of the project, with direct transfer of R&T development to Airbus operational support.

Conclusion

The workshop wrapped up with a comment from A. Gramfort who explained that the project has grown to a size that is hard to maintain at the moment. The funding of scikit-learn development requires about 300-400 Keur/year, and so far this has been provided mainly by public funds, but this situation is not sustainable. Hence the founders are looking for alternative solutions. Since many companies, ranging from startups to established industrial players are currently prototyping with scikit-learn, and some contractors are willing to fund its development, there are prospects to create an entity that could accept and manage donations as done by the Wikipedia Foundation.

We warmly thank the speakers for their help and input with reviewing this article.