You have 2 free member-only stories left this month.

Data Leakage in Machine Learning

How to prevent issues that reduce the quality of your models and/or cause inconsistent results

Image for post
Image for post
https://media.giphy.com/media/EHcpe9guGONCU/giphy.gif

Introduction

When training a machine learning model, we normally aim for the model that scores the highest on some metric, such as accuracy. Naturally, then, when we train a model that appears to score very well on our validation or test data-set, we select it as a well-performing model and productionize/finalize it.

However, have you ever encountered a situation in which a model performs well during testing, but fails to achieve the same level of performance during real-world usage? For example, has your model reached 99% accuracy during testing, but as soon as it is productionized and acts on real data, it fails to get anywhere near that level of performance?

Such a discrepancy between test performance and real-world performance is often explained by a phenomenon called data leakage.

Data Leakage

Data leakage refers to a mistake make by the creator of a machine learning model in which they accidentally share information between the test and training data-sets. Typically, when splitting a data-set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.

Data leakage often results in unrealistically-high levels of performance on the test set, because the model is being ran on data that it had already seen — in some capacity — in the training set. The model effectively memorizes the training set data, and is easily able to correctly output the labels/values for those test data-set examples. Clearly, this is not ideal, as it misleads the person evaluating the model. When such a model is then used on truly unseen data, performance will be much lower than expected.

Causes of Data Leakage

Now I will mention some common causes of data leakage. It is important to avoid these situations when training models of your own. In general, you should avoid doing anything to your training set that involves having knowledge of the test set.

Pre-processing

A very common error that people make is to leak information in the data pre-processing step of machine learning. It is essential that these transformations only have knowledge of the training set, even though they are applied to the test set as well. For example, if you decide that you want to run PCA as a pre-processing step, you should fit your PCA model on only the training set. Then, to apply it to your test set, you would only call its transform method (in the case of a scikit-learn model) on the test set. If, instead, you fit your pre-processor on the entire data-set, you will leak information from the test set, since the parameters of the pre-processing model will be fitted with knowledge of the test set.

Duplicates

Another error, which is especially common when your data-set comes from noisy, real-world data, is data duplication. This occurs when your data-set contains several points with identical or near-identical data. For example, if your data-set contained user messages on a messaging platform, duplicates may correspond to spammers who send the same message to many users. In this situation, you may experience data leakage simply due to the fact that your train and test set may contain the same data point, even though they may correspond to different observations. This can be fixed by de-duplicating your data-set prior to splitting into train and test sets. You can either do this by removing exact duplicates, or by using a fuzzy matching method (such as via edit distance for text data) to remove approximate matches.

Temporal Data (implicit leakage)

Even when you are not explicitly leaking information, you may still experience data leakage if there are dependencies between your test and train set. A common example of this occurs with temporal data, which is data where time is a relevant factor, such as time-series data. Consider the following toy example: your training set consists of two data-points A and C, and your training set consists of one data-point, B. Now, suppose that the temporal ordering of these data-points is ABC. Here, we have most likely created a data leakage simply through the way we created our training and test sets. By training on point C and testing on point B, we created an unrealistic situation in which we train our model on future knowledge, relative to the test set’s point in time. Therefore, we have leaked information, as in a real-world scenario, our model clearly would not have any knowledge of the future. In order to fix this problem, you should ensure that your test-train split is also split across time. So, everything in your training set should occur before everything in your test set. This creates a much more realistic training situation, and allows you to properly evaluate your model as if it were acting on incoming real-world data.

Software engineering / Machine learning

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Master the Split-Apply-Combine pattern in Python with this visual guide to Pandas groupby-apply.

TL;DR

Pandas groupby-apply is an invaluable tool in a Python data scientist’s toolkit. You can go pretty far with it without fully understanding all of its internal intricacies. However, sometimes that can manifest itself in unexpected behavior and errors. Ever had one of those? Or maybe you’re struggling to figure out how to deal with more advanced data transformation problem? Then read this visual guide to Pandas groupby-apply paradigm to understand how it works, once and for all.

Image for post
Image for post

Source: Courtesy of my team at Sunscrapers.

Introduction

Solid understanding of the groupby-apply mechanism is often crucial when dealing with more advanced data transformations…


Exactly how powerful is FB research’s sentence encoder?

This article discusses the implementation and success of facebook research’s InferSent and then touches on a fun and funny paper comparing InferSent to a random encoder.

Supervised learning of universal sentence representations from natural language inference data: https://arxiv.org/abs/1705.02364

No training required: exploring random encoders for sentence classification: https://openreview.net/pdf?id=BkgPajAcY7

Word Representations

It’s hard to apply mathematical models directly to text because the units of text — e.g. letters, words, clauses, phrases — are discrete and lack clear similarity / additive / compositional structure. Bag of words and word embeddings are two common word-level methods to turn textual data into numeric data.

In the…


Federated Learning and Additive Secret Sharing using the PySyft framework.

Federated Learning involves training on a large corpus of high-quality decentralized data present on multiple client devices. The model is trained on client devices and thus there is no need for uploading the user’s data. Keeping the personal data on the client’s device enables them to have direct and physical control of their own data.

Image for post
Image for post
Figure 1: Federated Learning

The server trains the initial model on proxy data available beforehand. The initial model is sent to a select number of eligible client devices. The eligibility criterion makes sure that the user’s experience is not spoiled in an attempt to train the model. An optimal…


Using Python’s Data Science Modules to break down Statistics and advanced metrics to see which player is the better fit for the Boston Celtics.

Image for post
Image for post
Kyrie Irving and Kemba Walker on their former teams (1)

This NBA off-season was a firework of movements across the league. It seems like ages ago now that Kyrie Irving left town for Brooklyn to team up with Kevin Durant. This was less than a year after Kyrie claimed to love Boston and that he was here for the long haul. Unlike other Boston fans however, I have a hard time holding hate for Kyrie in my heart because I like him so much as a player — that’s besides the point of this article.

Now the surprise for us Boston fans was that Kemba Walker of the Charlotte Hornets…


Image for post
Image for post

hacking music towards the democratization of derivative content

This is the second article under the ‘Audio AI’ series I began back in March and it can be considered Part 2 after my first article on vocal isolation using CNNs. If you haven’t read that one yet, I highly recommend you start there!

As a quick recap, in that first article, I showed you that we can build a pretty-small-for-the-task Convolutional Neural Network (~300k parameters) to perform vocal isolation in real-time. We tricked this network into ‘thinking it was solving a simpler problem and eventually we got this kind of results:

Now, how do we go from here…