Expect the Unexpected — Detecting Post Model Deployment Data Drifts
It’s best not to wait for others to tell you that your Machine Learning model is misbehaving.
We all love the moment we deploy our models and finally see them delivering actual value, don’t we? Sadly, there is no time to waste in post-deployment. Since machine learning relies on patterns from the past to generalize for the future, any underlying structural change in the data can mean disaster.
Rest assured that the world will keep on changing. There is probably a data-drifting black swan event waiting for us just around the corner. Thankfully, we have some pretty neat tricks to monitor our data!
Robust Drift Detection Methods
Given the relevance in the MLOps world, much research has been happening in this area. NannyML has been excellently researching and implementing several of these. Here, I want to present and explain some of the most commonly applied ideas, briefly discussing their advantages and disadvantages.
Univariate Drift Analysis
This approach involves looking at a single feature to identify if its distribution has changed over time. The usual practice is to take a reference, ground truth distribution, and apply statistical comparisons against the analyzed period distribution, measuring the likelihood that both are samples of the same population.
Continuous variables
For continuous variables, one option is to apply the Kolmogorov-Smirnov test. Intuitively, it compares the reference and analyzed data cumulative distribution functions (translates to a “curve that shows the probability of a variable being ≤ a specific variable value x”), taking the maximum absolute vertical distance between them as the d-statistic.
Ideally, the d-statistic should be close to zero, meaning reference and analyzed data are similar. Given the obtained value, we should also calculate its p-value — the likelihood that both distributions come from the same population when we observe a d-statistic at least as large as the one observed (call it the null hypothesis). We reject our null hypothesis if the p-value ≤ 0.05 (by convention). Therefore, we found a data drift!

Categorical variables
For categorical variables, we often use the Chi-Squared test or chi2 test, as it is sometimes called. Intuitively, given a contingency table, it tells us whether there is a statistically significant difference between the reference and the analyzed data categories’ frequencies by comparing it against the expected frequencies table (built assuming both samples come from the same population).

The bigger the chi-squared statistic, the more different the results between the two samples we are comparing are. We should also inspect the p-value of a result at least as significant as observed, given the chi2 distribution of our contingency table’s degrees of freedom. Once again, our null hypothesis is that both samples come from identical populations. If the p-value is less than 0.05, we reject the null hypothesis (data drift identified!).

Univariate analysis considerations
Pros: It’s a straightforward method, fully explainable, and easy to communicate.
Cons: High risk of generating false alerts since there is monitoring for each feature. Can’t detect changes in feature relationships.
Multivariate Drift Analysis
It’s a method that will search for significant data changes by analyzing a set of features altogether over time. The main idea is to detect data drifts even when feature distributions are still precisely the same but their interactions have changed.

To achieve this, people will usually take the reference data and look for ways to learn its underlying structure. Analyzed data points can then get mapped to a latent space and reconstructed, given the learnings acquired from encoding reference data points. We can do this by training an autoencoder neural network or applying the classic principal component analysis (PCA) algorithm, for example. Here, I’ll explore the latter approach.

PCA Data Reconstruction Error
There are only three steps needed to measure data drifts with this method.
First, we should prepare our data for PCA. One way of doing this is to encode categorical features with their respective frequencies and standardize all features to 0 mean and unit variance (ensuring feature scale does not affect PCA since it analyses variance).
We then fit PCA to the reference dataset to learn how to best project each data point onto only the first few principal components to obtain lower-dimensional data. It aims to perform this transformation while preserving as much of the data’s variation as possible (we could tune the desired preserved variation, NannyML defaults to 65%).
The last step is to measure the reconstruction error on the analyzed data. We apply the learned latent representation mapping followed by the inverse PCA transformation to achieve this. Then, the average distance (usually euclidean) between the original and reconstructed data points is called the reconstruction error.

Since there’s always noise in real-world datasets, it’s not uncommon to see some variability in the reconstruction error results. Still, we can estimate the average error and its standard deviation by applying this logic to various reference data splits. With that in hand, it’s easy to set well-informed data drift detection thresholds!
Multivariate analysis considerations
Pros: Reduces false alerts by providing a single summary number and detects changes in the underlying data structure that univariate approaches cannot see.
Cons: Once a data drift is detected, we won’t know exactly where it happened effortlessly. It’s also not as explainable as the univariate approach.
Just like in most real-life situations, there is no silver bullet. Both univariate and multivariate data drift detection approaches are a blessing in their complementary ways, and many implementations exist. Numerous early AI adopters will forget about it and fail hard without even knowing it, but we should comprehend that data drift is a thing and design means to react proactively. Put it in the equation and sidestep wasting months of hard work!
And that is It! Consider following me for more advanced data-related content. Also, connect with me on LinkedIn; let’s continue the discussion there!