You have 2 free member-only stories left this month.
Data Leakage in Machine Learning
How to prevent issues that reduce the quality of your models and/or cause inconsistent results
Introduction
When training a machine learning model, we normally aim for the model that scores the highest on some metric, such as accuracy. Naturally, then, when we train a model that appears to score very well on our validation or test data-set, we select it as a well-performing model and productionize/finalize it.
However, have you ever encountered a situation in which a model performs well during testing, but fails to achieve the same level of performance during real-world usage? For example, has your model reached 99% accuracy during testing, but as soon as it is productionized and acts on real data, it fails to get anywhere near that level of performance?
Such a discrepancy between test performance and real-world performance is often explained by a phenomenon called data leakage.
Data Leakage
Data leakage refers to a mistake make by the creator of a machine learning model in which they accidentally share information between the test and training data-sets. Typically, when splitting a data-set into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the test set’s purpose is to simulate real-world, unseen data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.
Data leakage often results in unrealistically-high levels of performance on the test set, because the model is being ran on data that it had already seen — in some capacity — in the training set. The model effectively memorizes the training set data, and is easily able to correctly output the labels/values for those test data-set examples. Clearly, this is not ideal, as it misleads the person evaluating the model. When such a model is then used on truly unseen data, performance will be much lower than expected.
Causes of Data Leakage
Now I will mention some common causes of data leakage. It is important to avoid these situations when training models of your own. In general, you should avoid doing anything to your training set that involves having knowledge of the test set.
Pre-processing
A very common error that people make is to leak information in the data pre-processing step of machine learning. It is essential that these transformations only have knowledge of the training set, even though they are applied to the test set as well. For example, if you decide that you want to run PCA as a pre-processing step, you should fit your PCA model on only the training set. Then, to apply it to your test set, you would only call its transform
method (in the case of a scikit-learn model) on the test set. If, instead, you fit your pre-processor on the entire data-set, you will leak information from the test set, since the parameters of the pre-processing model will be fitted with knowledge of the test set.
Duplicates
Another error, which is especially common when your data-set comes from noisy, real-world data, is data duplication. This occurs when your data-set contains several points with identical or near-identical data. For example, if your data-set contained user messages on a messaging platform, duplicates may correspond to spammers who send the same message to many users. In this situation, you may experience data leakage simply due to the fact that your train and test set may contain the same data point, even though they may correspond to different observations. This can be fixed by de-duplicating your data-set prior to splitting into train and test sets. You can either do this by removing exact duplicates, or by using a fuzzy matching method (such as via edit distance for text data) to remove approximate matches.
Temporal Data (implicit leakage)
Even when you are not explicitly leaking information, you may still experience data leakage if there are dependencies between your test and train set. A common example of this occurs with temporal data, which is data where time is a relevant factor, such as time-series data. Consider the following toy example: your training set consists of two data-points A and C, and your training set consists of one data-point, B. Now, suppose that the temporal ordering of these data-points is A →B →C. Here, we have most likely created a data leakage simply through the way we created our training and test sets. By training on point C and testing on point B, we created an unrealistic situation in which we train our model on future knowledge, relative to the test set’s point in time. Therefore, we have leaked information, as in a real-world scenario, our model clearly would not have any knowledge of the future. In order to fix this problem, you should ensure that your test-train split is also split across time. So, everything in your training set should occur before everything in your test set. This creates a much more realistic training situation, and allows you to properly evaluate your model as if it were acting on incoming real-world data.