Handling the Challenge of Model Drift in Data Lakes
Written by Dr. Nathan Green, Marymount University and Oliver Forbes, NTT DATA
One of the most constant and evolving characteristics of the sharing of information, is data in its readable form and its various models of consumption. Machine learning is an impactful tool of analysis that plays a vital role in how information is interpreted; and as machine learning models are deployed in cloud environments, companies must monitor the performance of their models over time whether trained in an offline or online environment. What underpins the ability to leverage the output of machine learning is locked solely on its immutable traits. Security and the concept of monitoring is ultimately a driving force behind whether immutability is invoked or its left to disclose a set of data which has subsequently - drifted. To highlight, model degradation is a common scenario which is very often caused by model drift. But how do we define “drift”? We define model drift as a combination of data drift and concept drift.
Data drift is a phenomenon that can occur when your independent variable, machine learning input features, change from their original baseline. This change can be for a number of reasons ranging from faulty equipment, covariate shift, to a natural drift over time or a change in meaning of the independent variable.
Concept drift, quite contrary to data draft, is often outside of institutional control. In concept drift your dependent variable, the target your model is predicting, changes meaning. This can be seen in the latest research around vaccine hesitancy. In a model that labeled data as pro-vaccine or anti-vaccine, the concept of anti-vaccine likely drifted from certain populations and into much broader arguments during the covid-19 pandemic. These changes in concept would not necessarily be picked up by models training on past vaccine debates.
Research into model drift, including both model and concept drift, has led to three general approaches to combat model degradation.
Often the simplest solutions are effective in machine learning. If drift happens naturally overtime this can be handled through simply interval-based retraining. If you see natural degradation of your model every 10 weeks, then retraining every 8 weeks might be the most cost-effective solution. While effective for the most common forms of drift this may not be appropriate for models in critical application areas where faulty equipment or change in meaning might go unnoticed for 8 weeks.
2. Statistical distributions:
Tests such as Kolmogorov-Smirnov have been shown to be useful for detecting out-of-domain data or a significant change in the distribution of your target data. Azure allows you to monitor drift using a baseline dataset and your current target dataset and monitor this feature over time.
3. Ensemble Modeling:
This is a lesser-known approach that has been used to reduce the impact of model drift and has attributes that are appealing to a security audience dealing with models that are part of a critical application area. The general idea is to separate a multi-label classification model into multiple models that predict single labels. With this situation, concept drift can be singled out for one particular target and retrained while keeping the rest of the model functioning as normal. This is likely a good approach when the model is consuming plenty of social media data and targeting a trending topic.
The three methods suggested can all work with proper surveillance. They differ greatly in the expertise needed to maintain a secure system. Retraining can be thought of as automated and relatively hands off with proper regression testing. Statistical distribution tests will generally need to be looked at by a Data Scientist periodically to analyze performance and decide whether to reengineer or to retrain a model. Ensemble modeling takes the most technical intervention but adds a robustness that might be desired by critical applications.
The approach is clear, when adopting a security-oriented data driven focus, the capability of retraining your data, or even introducing ensemble modeling, or engaging in statistical distribution tests; the fundamental step of monitoring the performance of your models over time provides the launch point in the race against data drift.
We invite you to join the CSA Artificial Intelligence Circle Community to better understand and contribute to the security techniques against data drift.
Sign up to receive CSA's latest blogs
This list receives 1-2 emails a month.