You have 2 free member-only stories left this month.

Data Leakage in Machine Learning

How to prevent issues that reduce the quality of your models and/or cause inconsistent results

Devin Soni 👑

Jul 16, 2019·4 min read

Image for post — https://media.giphy.com/media/EHcpe9guGONCU/giphy.gif

Introduction

When training a machine learning model, we normally aim for the model that scores the highest on some metric, such as accuracy. Naturally, then, when we train a model that appears to score very well on our validation or test data-set, we select it as a well-performing model and productionize/finalize it.

However, have you ever encountered a situation in which a model performs well during testing, but fails to achieve the same level of performance during real-world usage? For example, has your model reached 99% accuracy during testing, but as soon as it is productionized and acts on real data, it fails to get anywhere near that level of performance?

Such a discrepancy between test performance and real-world performance is often explained by a phenomenon called data leakage.

Data Leakage

Data leakage refers to a mistake make by the creator of a machine learning model in which they accidentally share information between the test and training data-sets. Typically, when splitting a data-set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

Data leakage often results in unrealistically-high levels of performance on the test set, because the model is being ran on data that it had already seen — in some capacity — in the training set. The model effectively memorizes the training set data, and is easily able to correctly output the labels/values for those test data-set examples. Clearly, this is not ideal, as it misleads the person evaluating the model. When such a model is then used on truly unseen data, performance will be much lower than expected.

Causes of Data Leakage

Now I will mention some common causes of data leakage. It is important to avoid these situations when training models of your own. In general, you should avoid doing anything to your training set that involves having knowledge of the test set.

Pre-processing

A very common error that people make is to leak information in the data pre-processing step of machine learning. It is essential that these transformations only have knowledge of the training set, even though they are applied to the test set as well. For example, if you decide that you want to run PCA as a pre-processing step, you should fit your PCA model on only the training set. Then, to apply it to your test set, you would only call its transform method (in the case of a scikit-learn model) on the test set. If, instead, you fit your pre-processor on the entire data-set, you will leak information from the test set, since the parameters of the pre-processing model will be fitted with knowledge of the test set.

Duplicates

Another error, which is especially common when your data-set comes from noisy, real-world data, is data duplication. This occurs when your data-set contains several points with identical or near-identical data. For example, if your data-set contained user messages on a messaging platform, duplicates may correspond to spammers who send the same message to many users. In this situation, you may experience data leakage simply due to the fact that your train and test set may contain the same data point, even though they may correspond to different observations. This can be fixed by de-duplicating your data-set prior to splitting into train and test sets. You can either do this by removing exact duplicates, or by using a fuzzy matching method (such as via edit distance for text data) to remove approximate matches.

Temporal Data (implicit leakage)

Even when you are not explicitly leaking information, you may still experience data leakage if there are dependencies between your test and train set. A common example of this occurs with temporal data, which is data where time is a relevant factor, such as time-series data. Consider the following toy example: your training set consists of two data-points A and C, and your training set consists of one data-point, B. Now, suppose that the temporal ordering of these data-points is A →B →C. Here, we have most likely created a data leakage simply through the way we created our training and test sets. By training on point C and testing on point B, we created an unrealistic situation in which we train our model on future knowledge, relative to the test set’s point in time. Therefore, we have leaked information, as in a real-world scenario, our model clearly would not have any knowledge of the future. In order to fix this problem, you should ensure that your test-train split is also split across time. So, everything in your training set should occur before everything in your test set. This creates a much more realistic training situation, and allows you to properly evaluate your model as if it were acting on incoming real-world data.

TL;DR

Pandas groupby-apply is an invaluable tool in a Python data scientist’s toolkit. You can go pretty far with it without fully understanding all of its internal intricacies. However, sometimes that can manifest itself in unexpected behavior and errors. Ever had one of those? Or maybe you’re struggling to figure out how to deal with more advanced data transformation problem? Then read this visual guide to Pandas groupby-apply paradigm to understand how it works, once and for all.

Source: Courtesy of my team at Sunscrapers.

Word Representations

It’s hard to apply mathematical models directly to text because the units of text — e.g. letters, words, clauses, phrases — are discrete and lack clear similarity / additive / compositional structure. Bag of words and word embeddings are two common word-level methods to turn textual data into numeric data.

In the…

Data Leakage in Machine Learning

How to prevent issues that reduce the quality of your models and/or cause inconsistent results

Introduction

Data Leakage

Causes of Data Leakage

Pre-processing

Duplicates

Temporal Data (implicit leakage)

Devin Soni 👑

Sign up for The Variable

By Towards Data Science

More from Towards Data Science

How to use the Split-Apply-Combine strategy in Pandas groupby

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas `groupby-apply`.

TL;DR

Introduction

The hype behind InferSent

Exactly how powerful is FB research’s sentence encoder?

Word Representations

Introduction to Federated Learning and Privacy Preservation

Federated Learning and Additive Secret Sharing using the PySyft framework.

A Data Driven Analysis of the Kemba-Kyrie “swap”

Using Python’s Data Science Modules to break down Statistics and advanced metrics to see which player is the better fit for the Boston Celtics.

Audio AI: isolating instruments from stereo music using Convolutional Neural Networks

hacking music towards the democratization of derivative content

More From Medium

11 Python Built-in Functions You Should Know

Top 10 Python Libraries for Data Science in 2021

Building a sonar sensor array with Arduino and Python

How to Extract the Text from PDFs Using Python and the Google Cloud Vision API

Jupyter has a perfect code editor

The Making of AI Snake Oil

10 Useful Jupyter Notebook Extensions for a Data Scientist.

Machine Learning ‘on the rocks’ 🥃

Data Leakage in Machine Learning

How to prevent issues that reduce the quality of your models and/or cause inconsistent results

Introduction

Data Leakage

Causes of Data Leakage

Pre-processing

Duplicates

Temporal Data (implicit leakage)

Devin Soni 👑

Sign up for The Variable

By Towards Data Science

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas groupby-apply.

TL;DR

Introduction

Exactly how powerful is FB research’s sentence encoder?

Word Representations

Federated Learning and Additive Secret Sharing using the PySyft framework.

Using Python’s Data Science Modules to break down Statistics and advanced metrics to see which player is the better fit for the Boston Celtics.

hacking music towards the democratization of derivative content

More From Medium

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas `groupby-apply`.