Data Poisoning: a threat to Machine Learning models

Q: How can we protect ourselves from data poisoning?

The first technique is to check the databases before injecting them into the model's training data. This can be done using statistical methods to detect anomalies in the data, regression tests, or manual moderation.

16 Jun 2023

m de lecture

Data Science

Daniel

Among the many computer attacks that exist and that attack IT systems, Data Poisoning is characterized by the falsification of training data for Machine Learning models. What does this mean? Does it represent a real danger? Here's a brief overview of this particular attack, the threats it poses, and how to defend against it.

What is data poisoning?

Data Poisoning attacks first appeared with the massive advent of Machine Learning models at the end of the 20th century.

These attacks occur during the training phase of machine learning models. A machine learning model needs to be trained with data to function. Gradually, the machine learning model will learn from its mistakes and perform its task more and more accurately.

A predictive model is a computer program that will be able to perform a particular task, such as recognizing

But the Data Poisoning attack, by acting on the training phase, will alter or even completely distort the results of the predictive model. The example of the attacks on Google’s anti-spam system between 2017 and 2018 shows just how they work. Google’s anti-spam model is trained with data known as input/label pairs.

The input is an email or text message, and the label indicates whether the message is spam or not.

This is where the Data Poisoning attack comes in. It will corrupt and falsify this training data on a massive scale, indicating, for example, that a spam message is not spam. This attack will alter the accuracy of the machine-learning model. In Google’s case, spammers can then rub their hands together:They can send spam without Google’s anti-spam model notifying them. Data poisoning attacks can also act on traffic sign recognition models, used for autonomous cars, for example. If this model is poisoned, it could very well confuse a stop sign with a speed limit sign.

This attack has become very accessible to even the smallest hacker. Previously, Data Poisoning attacks were difficult to implement because they required a lot of computing power, time, and money. But new techniques have made it possible to bypass these obstacles. The TrojanNet backdoor technique is particularly problematic. By creating a neural network that detects a series of patches, this technique does not require access to the original model and can be performed by a basic computer.

What are the dangers of Data Poisoning?

The fact that a data poisoning attack has become so accessible makes it a real danger. Once the model training phase is over, it’s very difficult to correct the machine learning model. It would require a lengthy analysis of all the inputs that have trained the model, to detect fraudulent inputs and remove them. But if the mass of data is too large, this analysis is simply impossible. The only solution is to retrain the model.

But these training phases can be extremely costly: in the case of the GPT-3 artificial intelligence system developed by Open IA, the training phase cost around 16 million euros…

Data poisoning is not just an economic cost, it can also represent an even greater danger. Artificial intelligence and machine learning models are becoming increasingly important in our societies, and are being used for tasks of the utmost importance, such as healthcare, transport, and criminal investigations. For example, the Chicago police are using AI to fight crime, to predict where and when violent crimes will break out.

What happens if the data in their models is poisoned? Crime-fighting becomes ineffective, and the models steer police officers in the wrong direction.

How can we protect ourselves from data poisoning?

Fortunately, there are ways to combat data poisoning.

The first technique is to check the databases before injecting them into the model’s training data. This can be done using statistical methods to detect anomalies in the data, regression tests, or manual moderation.
You can also spot any drop in model performance during the training phase and react immediately, thanks to cloud tools such as Azure Monitor or Amazon SageMaker.
Finally, as data poisoning requires prior knowledge of the model, it is important to keep the model’s operating information secret during the training phase.

Data poisoning, therefore, represents a real IT threat, and all the more so as these attacks are becoming increasingly accessible to hackers. The challenge, however, is to keep pace with the technical progress made by hackers and to improve prevention systems. Data Scientists and Data Engineers are on the front line in combating these attacks. They are the ones who will have to collect secure data or detect attacks during the training phases. If you’d like to find out more about how these models work and how to protect them, take a look at our training courses in the data professions.

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI