Skip to main content Skip to complementary content

Data leakage

Data leakage means that the data used to train a machine learning algorithm includes the information you are trying to predict. This could lead to the model performing better in training than it would in the real world, creating a false assurance of how well the model performs. Learn how to identify and prevent data leakage to get reliable predictions.

There are two forms of data leakage:

  • When one or more features in the training set can be used to derive the target variable you are trying to predict.

  • When one or more features in the training set includes information that would not be known at the time of prediction.

In the following table, the column Stage is a duplicate column of the column Stage (Binary) that we want to predict. By including Stage in the training dataset, we would be providing the answer to the anticipated result, leading to a high score for our model.

Table with the "leaky column" Stage that contains information about the target column Stage (Binary)

Table with sample data.

Identifying data leakage

To identify data leakage, consider questions like "Will you have the same information for records at the time you want to make a prediction?" or "Will the record be the same in 30 days?". Remember that all data in your training dataset must be relevant to the time constraint in your business question.

When you have trained a model, you can look for the following clues in the model metrics.

  • High scores: Is the score really high? For example, is the F1 score above 85?

  • Feature importance: Is one feature a lot more important than everything else?

  • Holdout score: Is the holdout score much lower than the cross-validation score?

The table shows examples of common features that might cause data leakage.

Business use case Target

Potentially leaky features

Will a sales opportunity close?

Close (Yes or No)

Stage, close date, invoice details, commissions paid

Predict a future transaction amount

Amount of the next transaction

Taxes, order details

Will a lead convert to an opportunity?

Convert (Yes or No)

Opportunity details, conversion date

Will a customer churn?

Churn (Yes or No)

Churn reason, churn date, static customer tenure, customer temperature

Will an employee voluntarily term?

Terminate (Yes or No)

Exit interview details, term date, resignation letter information

Preventing data leakage

The best way to prevent data leakage is to use the structured framework to get a good business question and dataset. For more information, see Defining machine learning questions.

Tip noteIf you have identified a leaky column that should not be used in the model training, you can still keep it in the dataset. Just exclude this feature from the training data in your machine learning experiment.
Related learning:

Learn more

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!