Expect the Unexpected — Detecting Post Model Deployment Data Drifts

It’s best not to wait for others to tell you that your Machine Learning model is misbehaving.

Gabriel Tardochi Salles
5 min readSep 14, 2022
A yellow car drifting.
Photo by Vladyslav Lytvyshchenko on Unsplash

We all love the moment we deploy our models and finally see them delivering actual value, don’t we? Sadly, there is no time to waste in post-deployment. Since machine learning relies on patterns from the past to generalize for the future, any underlying structural change in the data can mean disaster.

Rest assured that the world will keep on changing. There is probably a data-drifting black swan event waiting for us just around the corner. Thankfully, we have some pretty neat tricks to monitor our data!

Robust Drift Detection Methods

Given the relevance in the MLOps world, much research has been happening in this area. NannyML has been excellently researching and implementing several of these. Here, I want to present and explain some of the most commonly applied ideas, briefly discussing their advantages and disadvantages.

Univariate Drift Analysis

This approach involves looking at a single feature to identify if its distribution has changed over time. The usual practice is to take a reference, ground truth distribution, and apply statistical comparisons against the analyzed period distribution, measuring the likelihood that both are samples of the same population.

Continuous variables

For continuous variables, one option is to apply the Kolmogorov-Smirnov test. Intuitively, it compares the reference and analyzed data cumulative distribution functions (translates to a “curve that shows the probability of a variable being ≤ a specific variable value x”), taking the maximum absolute vertical distance between them as the d-statistic.

Ideally, the d-statistic should be close to zero, meaning reference and analyzed data are similar. Given the obtained value, we should also calculate its p-value — the likelihood that both distributions come from the same population when we observe a d-statistic at least as large as the one observed (call it the null hypothesis). We reject our null hypothesis if the p-value ≤ 0.05 (by convention). Therefore, we found a data drift!

KS test between summer and fall temperature distributions. The d-statistic is 0.52, with pvalue = 0.
We can conclude that these two distributions are not from the same population. Image by the author.

Categorical variables

For categorical variables, we often use the Chi-Squared test or chi2 test, as it is sometimes called. Intuitively, given a contingency table, it tells us whether there is a statistically significant difference between the reference and the analyzed data categories’ frequencies by comparing it against the expected frequencies table (built assuming both samples come from the same population).

Summer vs. fall weather situation’s contingency table is on top, with expected frequencies at the bottom. Image by author.

The bigger the chi-squared statistic, the more different the results between the two samples we are comparing are. We should also inspect the p-value of a result at least as significant as observed, given the chi2 distribution of our contingency table’s degrees of freedom. Once again, our null hypothesis is that both samples come from identical populations. If the p-value is less than 0.05, we reject the null hypothesis (data drift identified!).

Plot of the chi2 distribution for values of k (degrees of freedom). By Geek3 — Own work, CC BY 3.0.
Plot of the chi2 distribution for values of k (degrees of freedom). By Geek3 — Own work, CC BY 3.0.

Univariate analysis considerations

Pros: It’s a straightforward method, fully explainable, and easy to communicate.

Cons: High risk of generating false alerts since there is monitoring for each feature. Can’t detect changes in feature relationships.

Multivariate Drift Analysis

It’s a method that will search for significant data changes by analyzing a set of features altogether over time. The main idea is to detect data drifts even when feature distributions are still precisely the same but their interactions have changed.

Feature A and feature B having the exact same distributions in 2020 and 2021, but their correlation rotated from 1 to -1.
Features A and B’s univariate distribution did not change, but their correlation rotated. Image by author.

To achieve this, people will usually take the reference data and look for ways to learn its underlying structure. Analyzed data points can then get mapped to a latent space and reconstructed, given the learnings acquired from encoding reference data points. We can do this by training an autoencoder neural network or applying the classic principal component analysis (PCA) algorithm, for example. Here, I’ll explore the latter approach.

Autoencoder ilustrated, with a encoding block, a bottleneck and a decoding block.
High-level understanding of an autoencoder for representation learning. Image by author.

PCA Data Reconstruction Error

There are only three steps needed to measure data drifts with this method.

First, we should prepare our data for PCA. One way of doing this is to encode categorical features with their respective frequencies and standardize all features to 0 mean and unit variance (ensuring feature scale does not affect PCA since it analyses variance).

We then fit PCA to the reference dataset to learn how to best project each data point onto only the first few principal components to obtain lower-dimensional data. It aims to perform this transformation while preserving as much of the data’s variation as possible (we could tune the desired preserved variation, NannyML defaults to 65%).

The last step is to measure the reconstruction error on the analyzed data. We apply the learned latent representation mapping followed by the inverse PCA transformation to achieve this. Then, the average distance (usually euclidean) between the original and reconstructed data points is called the reconstruction error.

Graph displaying PCA reconstruction error when we take weather-related features and use only January and February as reference data. The following months show an increase in the error, since seasons will change the structure of our dataset.
Reconstruction error increases for weather-related features across different seasons. The multivariate analysis could detect a change in the underlying data structure — thus, a data drift. Image by author.

Since there’s always noise in real-world datasets, it’s not uncommon to see some variability in the reconstruction error results. Still, we can estimate the average error and its standard deviation by applying this logic to various reference data splits. With that in hand, it’s easy to set well-informed data drift detection thresholds!

Multivariate analysis considerations

Pros: Reduces false alerts by providing a single summary number and detects changes in the underlying data structure that univariate approaches cannot see.

Cons: Once a data drift is detected, we won’t know exactly where it happened effortlessly. It’s also not as explainable as the univariate approach.

Just like in most real-life situations, there is no silver bullet. Both univariate and multivariate data drift detection approaches are a blessing in their complementary ways, and many implementations exist. Numerous early AI adopters will forget about it and fail hard without even knowing it, but we should comprehend that data drift is a thing and design means to react proactively. Put it in the equation and sidestep wasting months of hard work!

And that is It! Consider following me for more advanced data-related content. Also, connect with me on LinkedIn; let’s continue the discussion there!

--

--