Machine Learning- Data Leakage

Data leakage in machine learning can result in unreliable predictions and inaccurate model evaluations. Preventing leakage through manual review, preprocessing pipelines, and cross-validation improves model reliability and generalization.

Rahul S
4 min readOct 7, 2023

Data leakage in machine learning refers to the situation when information from the test dataset is mistakenly included in the training dataset.

This leads to artificially inflated performance metrics during training, but poor production performance. The model ends up learning the patterns of test data that it should not have access to, resulting in an inaccurate evaluation and unreliable predictions.

TYPES OF DATA LEAKAGE

1. Leakage in Target Variable: Information that would not be available at the time of prediction is included as a feature in the model.

This can result in the model learning from the target variable or derived proxies, leading to overfitting and poor generalization.

To illustrate target leakage, let’s consider a scenario where we want to predict customer churn in a subscription service. We might think that including the “number of customer…

--

--