here.
In Part 2, we practiced how to induce and detect drifts using this GitHub package. We dived deeper into the drift detector families and their respective performance on various types of drifts. The link to the article is here.
Now, a question remains: how to choose the drift detector best suited for your situation? This is the aim of this article.
The below table is a brief summary of the detector families that we will compare throughout this article.
Family Name | Description | Example | Tool used |
---|---|---|---|
Univariate Drift Detectors | Detects drifts by going through each feature and comparing the new batch distribution with the training distribution | 2-sample Kolmogorov-Smirnov Test Wasserstein Distance Population Stability Index (PSI) | Evidently AI DeepChecks Tensorflow Data Validation |
Multivariate Drift Detectors | Detects drifts by looking at the covariates (not feature by feature) and comparing them with the training distribution | Margin Density Drift Detector Method (MD3) OLINDDA Hellinger Distance Based Drift Detection Method (HDDDM) | Github and literature articles |
Drift Detectors with Labels | Detects drifts using labels by looking at the model performance (called error rate) | Early Drift Detection Method (EDDM) Hoeffding Drift Detection Method (HDDM_W) ADaptive WINdowing (ADWIN) | Python package River |
Depending on your use case, you will have to compromise between a very accurate and sensitive detector with many potential false alarms or a less accurate detector where you avoid constant disturbances. You may also only be interested in drops in the model performance.
The article is split into four parts. The first part gives several scenarios and recommends the most appropriate drift detector family to use for each scenario. The next 3 parts give extended information about each drift detector family: part 2 gives guidelines about univariate drift detectors, part 3 about drift detectors with labels and part 4 about multivariate drift detectors.
If you’re only interested in one specific drift detector family, you are welcome to go directly onto the detailed section.
Because you cannot afford any mistakes, you will want to have full visibility of the dataset and its potential drifts over time. And because any type of drift could impact your model, you will be interested in both real and virtual concept drifts. Yet, you are aware that this level of monitoring will come at the expense of false alarms.
Implementing univariate drift detectors will therefore ensure that you keep an overview of all features of your model.
There are many platforms available: Cloud-based, private, or open-source. A particularly useful platform is DeepChecks [20]: its Feature Drift check efficiently detects virtual concept drifts. It also includes another useful check: Whole Dataset Drift which will enable you to also detect some real concept drifts.
Whole Dataset Drift works the following way: we add a column label which will be 0 for the training dataset sample and 1 for the new batch sample. We then merge the two datasets together and shuffle. The goal is to predict whether a data point is 0 or 1. If we can distinguish between the old and new datasets, then it is likely that a drift occurred. If we cannot distinguish between the two datasets, no need to worry about drifts. The below plot illustrates how the Whole Dataset Drift works.
For more information, you can visit Deepchecks’ documentation. It also gives information about the drift score and how to interpret it.
Evidently AI [8] can also be a powerful tool. Although the default metrics can be very sensitive (many false alarms), the platform gives you the flexibility to choose your own.
You are clearly interested in drifts that impact the model’s performance. This is exactly what the family of drift detectors with labels does: it compares the predicted and actual values and computes an error rate. However, it has a main drawback: you need immediate and fast access to those labels. If you are able to get those labels, you have found your ideal detector family: the drift detectors with labels.
However, if you are not able to access the true values, below are some alternative options:
One last comment about MD3: it can be easily adapted to Machine Learning models that aren’t rule-based. For aggregated models, we can use the “disagreement” between individual models to determine an “uncertain sample”.
You want a detection method that effectively detects large drifts but you don’t want alarms. We will first clarify what is meant by large drifts and then suggest detector methods.
By large drift, you may mean a drift that impacts the model’s performance. In that case, go to Scenario 2.
Large drift may also mean a geographical shift in the dataset. Although they may be associated with a change in the model’s performance, it may not be the case.
In the below graph, we use two variables to classify points as Class A or B. We can see that the new points are geographically different from the old points but the model performance has not changed: their classification is right.
In that case, drift detection methods focusing on the model’s performance would not detect this drift, whereas detectors looking at the geographical changes would detect the drift.
You decided to go with the univariate drift detector family. Now a question remains: Wasserstein Distance, Population Stability Index, Jensen-Shannon Divergence… which one to choose?
We have put together some key guidelines in order to help you find the univariate detector best suited to your situation.
Some detectors will only work for specific data types. Therefore, we can distinguish between detectors suited for categorical features or numerical features. Other detectors will also be working for both data types.
Suited for Categorical [9] | Suited for Numerical [9] |
---|---|
Kullback-Leibler divergence Jensen-Shannon Divergence Population stability index Chi-square (goodness of fit) Cramer’s V | Kullback-Leibler divergence Jensen-Shannon Divergence Population stability index Wasserstein Distance Kolmogorov-Smirnov Test |
Some drift detectors will be sensitive to the sample size. Some will not work well on small datasets, others will struggle with large datasets.
Roughly speaking, we can distinguish between statistical tests and scores. Statistical tests will be more efficient with small datasets (less than 10,000) whereas detectors using scores will be preferred for large datasets. [22]
There is an underlying statistical explanation: associated with large samples, even the smallest change can become statistically significant. Therefore, it is preferred to use scores instead of statistical tests for datasets with large sample sizes. [10]
It is also illustrated in an article written by the Drift Detector method: Evidently AI. [22] They compare five different metrics and conclude that score-based metrics are more efficient for datasets with large sample sizes.
Why don’t we just use score-based metrics then? Statistical test detectors have several advantages which we will see in the next two paragraphs.
Suited for small sample sizes | Suited for large sample sizes |
---|---|
Kolmogorov-Smirnov Test Chi-square (goodness of fit) Cramer’s V | Kullback-Leibler divergence Jensen-Shannon Divergence Population Stability Index (PSI) Wasserstein Distance |
Choosing the right detector will also depend on the size of the drift you want to detect.
Evidently AI did an experiment where the dataset was shifted by respectively 1%, 5%, 7%, 10%, and 20% and compared the results between 5 different univariate detectors. [22]
Population Stability Index and Kullback-Leibler Divergence only detected large changes: they have low sensitivity.
In contrast, Kolmogorov-Smirnov Test detected the smallest changes.
Jensen-Shannon Divergence and Wasserstein distance were in between: they showed “medium” sensitivity.
It is also important to be aware that those results use the default parameters of the metric. For instance, a drift is detected if the PSI value is higher than 0.2. However, it is always possible to tune those parameters using historical data or by comparing different parameter values.
Lastly, each detector will have a different interpretation.
Statistical tests are usually easily interpreted: your null hypothesis is that both datasets come from the same distribution. Based on a p-value, you either accept or reject your null hypothesis.
For scores, it is more tricky because they usually return a value from 0 to infinity. 0 means that the distributions are identical. The higher the score, the more different from each other are the two distributions. Some score-based detectors such as Population Stability Index have set threshold values that ease the interpretation [22]:
However, other detectors are difficult to interpret. Wasserstein Distance calculates the amount of work it takes to turn one distribution into another. However, unless normed, the distances cannot be compared between each other. For instance, if one feature is in kilometers and the other one is in degrees, you will need to interpret those distances separately.
You have access to labels and you decide to go with the drift detectors with labels. Which one should you choose?
First of all, there are a wide variety of drift detectors with labels. Unless you have time to dive deep into literature articles and implement manually some detectors, we would recommend using one of the eight detectors provided in the python package River [20].
Several literature articles have done large-scale comparisons of concept drift detectors. The 2018 article from Barros called “A Large-scale Comparison of Concept Drift Detectors” [1] is particularly useful as it compares 14 detectors across several datasets and types of drifts.
The below graph depicts the aggregated results for those 14 detectors and shows that HDDM [16] was the most performant in terms of the Matthews Correlation Coefficient (MCC) metric.
The MCC criterion is based on the four values of the confusion matrix: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (NF).
Moreover, you can see in yellow the algorithms implemented in the package River: both HDDM-A and HDDM-W are implemented.
Graph extracted from “A Large-scale Comparison of Concept Drift Detectors” from Barros (2018).
Therefore, our winner-takes-all solution would be to use HDDM (HDDM_A or HDDM_W, both parameters work well) with high performance in terms of Precision, Recall, and Accuracy (what the MCC criterion measures) [16].
If you prefer a more custom method, you can check those other guidelines:
Multivariate Drift Detectors are a very recent subject in research and are little known in the industry. However, they can be very powerful methods.
In research, multivariate drift detectors can be classified in different ways: by data type or model type. Both classification methods are independent.
Below are some guidelines to help you find what works best for your use case.
A 2020 article from Gemaque et al. “An overview of unsupervised drift detection methods” [5] introduces a taxonomy to classify multivariate drift detectors. Firstly, the authors distinguish between batch and stream data. In the batch part, they classify further how data is manipulated in the reference window. In the stream part, the criterion is how data is manipulated in the detection window (is the size fixed or dynamic).
Taxonomy extracted from “An overview of unsupervised drift detection methods” from Gemaque et al. (2017).
Therefore, in order to choose your optimal multivariate drift detector, it is recommended to check your data type and how you will process it. The below table is an extract of the article and gives one detector for each section:
Category | Subcategory | Method | References |
---|---|---|---|
Batch | Whole-batch detection | MD3 | Sethi and Kantardzic (2015) |
Partial-batch detection | DDAL | Costa, Albuquerque, and Santos (2018) | |
Online | Fixed reference window | CD-TDS | Koh (2016) |
Sliding reference window | SAND | Haque, Khan, and Baron (2016) |
Multivariate Drift Detectors are still at a very early stage in the industry. Therefore, you will probably need to invest some time to research and implement detectors.
If you don’t have time, we would recommend using existing implemented algorithms on GitHub. The Menelaus repository [15] has implemented several algorithms that can be easily reused.
In point 1, we talked about how one multivariate methodology may be more suited based on the data type. Now, we will cover the different categories of multivariate detectors so that you can select one based on your Machine Learning model. For instance, if you used a clustering model in your project such as K-means, you may want to also use a cluster-based detector.
We are relying on the taxonomy used in the 2017 article “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data” by Sethi and Kantardzic [18]. Multivariate drift detectors can be split into several categories. Here we will quickly mention three of them: novelty detection, multivariate distribution, and model-dependent. In our research, we have implemented one detector in each category.
Those methods rely on “distance and/or density information to detect previously unseen
data distribution patterns” (Sethi and Kantardzic, 2017) [18].
One of them is OLINDDA (implemented in the GitHub package) which uses K-means clustering in order to detect new patterns in the data. Those patterns can be translated into concept drifts or novelty points, as the below plot shows:
Here, S1 cluster illustrates a concept drift whereas the S2 cluster depicts a novelty.
Graph extracted from the article “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data” (Sethi and Kantardzic, 2017) [18] to illustrate clustering methods.
Those methods can be very powerful if the drift is manifested as a new cluster or a new region of space. However, they can suffer from the curse of dimensionality and are restricted to “cluster-able drifts” only.
Those methods “store summarized information of the training data chunk [...], as the reference distribution, to monitor changes in the current data chunk” (Sethi and Kantardzic, 2017) [18].
Hellinger distance and KL-divergence are often used to monitor those changes. HDDDM belongs to this family of methods and uses the Hellinger distance to compare both batches.
The below plot illustrates how HDDDM (implemented in the GitHub package) works:
Again, those methods are powerful and limited to specific drifts: when they manifest themselves as deviations of feature distributions.
Inspired by adversarial classification, those methods “discern changes that could adversely affect the classification performance” (Sethi and Kantardzic, 2017) [18].
The Margin Density Drift Detection Method (MD3) monitors the change in the number of samples in the margin, as the below plot shows. MD3 is implemented in the open-source package alibi-detect [17].
Graph extracted from the article “On the Reliable Detection of Concept Drift from Streaming Unlabeled Data” (Sethi and Kantardzic, 2017) [18] to illustrate adversarial classification methods.
Those methods have the advantage of reflecting drifts impacting the model’s performance, which leads to only a few false alarms.