Machine Learning: Data Drift and Concept Drift

This article explores the concepts of data drift and concept drift in machine learning models. It discusses the challenges posed by these drifts and provides strategies for monitoring and mitigating their effects.

Rahul S



Data drift refers to changes in the input data used for modeling.

This can happen due to various factors such as changes in the data sources, environmental factors, or user behavior.

CHANGE IN DATA SOURCE: Consider a machine learning model that predicts customer churn based on historical data. If the source of the data changes, such as a new system being implemented to collect customer information, the input data used for training the model may no longer accurately represent the current state of the customers. This change in the input data can lead to data drift.

ENVIRONMENTAL FACTORS: Data drift can also occur when there are changes in the environment in which the data is collected. For instance, if the data is collected from sensors in a manufacturing plant and the plant undergoes renovations or upgrades, the sensor readings may be affected, resulting in changes in the input data.

USER BEHAVIOR: Similarly, changes in user behavior, such as new trends or preferences, can also cause data drift. For instance, in an e-commerce setting, customer behavior and preferences may change over time, leading to shifts in the input data used for training the model.


To detect data drift, we employ various techniques and methods. Here are some commonly used approaches:

  1. Statistical tests: Statistical tests can be utilized to detect changes in the distribution of input data. The Kolmogorov-Smirnov test and Jensen-Shannon divergence are examples of statistical tests that can be applied. These tests compare the probability distributions of the current data with the reference data or historical data. If there is a significant difference, it indicates the presence of data drift.
  2. Model performance metrics