Data leakage is a worrying phenomenon that can jeopardize your company's security. Find out how to protect your sensitive data against data leakage and computer attacks.
What is Data Leakage ?
Data leakage is one of the most important points of vigilance when designing a predictive model. The creation of a predictive model stems from an operational need, and the aim is to create a predictive tool to meet business expectations. Performance and transparency are the watchwords of a good predictive model.
Performance measurement is an essential step in model development, as it lies at the heart of the predictive modeling problem. It ensures the tool’s usability by guaranteeing its robustness, as it enables us to assess the operational character of the models. Indeed, the better a model performs, the more reliable and therefore usable it is. To assess its performance, we use metrics to measure the quality of prediction by comparing predicted values with actual values.
During the design phase, we have a certain amount of data at our disposal. This data should enable us to both train and test the performance of our model. To obtain an accurate measure of performance, it is essential to have a sufficient quantity of data on which to test the model. This data must not be known to the model, and it must not be trained under any circumstances.
For this stage to go as smoothly as possible, you need to be very rigorous in the data preparation stage. Right from the start of the project, you must ensure that part of the data is set aside. If this is not done properly, data not intended for training could leak out and be used to train the model. This would then bias the results of the model when evaluated. This is what is known as data leakage in Machine Learning.
How can I determine if there has been a data leak?
A very good indicator is abnormally high model performance. Getting a very high score for a model that predicts a customer’s subscription, for example, sports predictions, should give us a red flag. It’s virtually impossible to obtain very high scores on issues such as these, as the degree of chance involved in the realization of an event is very high. So we need to take a step back from the results we’ve obtained and take care to check how we arrived at that score.
What precautions should be taken?
The train-test split technique (also known as hold-out) involves dividing the available data into two parts: one dedicated to training and the other to evaluation. Only after the model has been trained can the test data be consulted; before this stage, they must have been carefully set aside.
As previously mentioned, it’s only after this crucial data separation stage has been completed that we can proceed with data preparation (the preprocessing phase). During this stage, we decide which treatments are to be applied to our variables before training the chosen algorithm.
Why is it not possible to use all the available data?
To better understand how this works, let’s take a look at the imputation of the missing values phase. Let’s imagine that we want to impute all the missing values of a variable by its median. If we calculate the median on all the data (training and test sets combined), then the value of the median will be different from that calculated only on the training set. This will result in a data leak, as the position indicator contains information from the test set. This example extends, of course, to all the pre-processing steps that precede model training: imputation of missing values, treatment of extreme values, normalization, etc.
Of course, this precaution also applies to the cross-validation technique. Validation sets must be set aside to remain unknown to the model.
Conclusion
Performance is largely determined by the quality of the data, so it is important to ensure that they are prepared before training the model. Nevertheless, this is a delicate stage, as it is prone to data leakage. Great care must be taken to ensure that no information contained in the test set is used to train the model. To ensure a model’s true performance, we can only rely on this approach.
If you’d like to find out more about model prediction, don’t hesitate to read our article on data warehouse or our article on databases.