Data leakage in machine learning refers to the situation when information from the test dataset is mistakenly included in the training dataset.
This leads to artificially inflated performance metrics during training, but poor production performance. The model ends up learning the patterns of test data that it should not have access to, resulting in an inaccurate evaluation and unreliable predictions.
TYPES OF DATA LEAKAGE
1. Leakage in Target Variable: Information that would not be available at the time of prediction is included as a feature in the model.
This can result in the model learning from the target variable or derived proxies, leading to overfitting and poor generalization.
To illustrate target leakage, let’s consider a scenario where we want to predict customer churn in a subscription service. We might think that including the “number of customer service calls” as a feature would be informative since more calls could indicate a higher likelihood of churn.
But this information is a consequence of customers already deciding to churn. Customers who have already decided to churn out merely call to settle any outstanding issues before eventually canceling their subscriptions. Therefore, this feature is a result of churn and not a contributing factor.
Including Target Variable as part of Feature Variables, or any proxy that is directly or indirectly derived from the Target Variable, can lead to data leakage.
To prevent target leakage, it is crucial to carefully review the relationship between features and the target variable.
Conducting exploratory data analysis (EDA) can help uncover correlations between features and targets. High correlations should raise…