normalisation des données

Hello Daniel, what is data normalization?

Adrien R

Adrien R

3 min

Daniel is the technical support of DataScientest’s trainings. It is the expert on every subject related to data science, and which is now walking every of our learners through their trainings. He’s usually very busy… But today, we have managed to get a quick interview with him, so that he can answer a few of our questions about data normalization.

Normalisation des données

Me – Hello Daniel, I know you must have been asked this question already, but I always hear about data normalization. Could you help me understand the concept better?

Daniel Indeed, normalization, as it is heard in the data science area, is a very important concept in data pre-processing, when you need to work on a Machine Learning project.
Deux principaux procédés sont en réalité sous-entendus lorsqu’on parle de normalisation : la normalisation et la normalisation standard, ou plus communément connu sous le nom de standardisation. Dans l’ensemble, ces deux procédés ont la même vocation : redimensionner les variables numériques pour qu’elles soient comparables sur une échelle commune. 

Me - Mathematically speaking, what does it mean?

Let’s consider a numerical variable with n observations, than can be written as followed: 
Normalisation des données
As we have a finite number of real values, we can extract various statistical pieces of information, including: min, max mean and standard deviation.
 The process of normalization only needs the min and max functions
The idea is to bring back all the values of the variable between 0 and 1, while keeping some distance between the values.
In order to do that, you’ll use a simple formula:
Normalisation des données
As far as standardization , is concerned, the transformation is more subtle than easily bring back the values between 0 and 1 It aims at bringing back the mean μ to 0 and the standard deviation to 1.
Here again, the process is not very complicated: if you already know the mean μ and the standard deviation σ σ of a variable X = x1 x2 xn you will write the standardized variable as followed:  
Normalisation des données

Me – All of this is amazing, but how is it related to data science?

In data science, you’re often dealing with numerical data, and you can rarely compare these data in their original state. .
Working with data of different can be a problem in an analysis, because a numerical variable with a range of values between 0 and 10,000 will carry more weight in the analysis than a variable with values between 0 and 1, which would cause a bias problem later on.
However, be careful not to consider normalization as an mandatory step in the processing of our data, it constitutes a loss of information in the immediate future and can be detrimental in certain cases!

I understand better, but a question remains, how do you normalize data concretely?

On Python it is very simple, many libraries allow it. I will only mention  Scikit-learn because it is the most used in Data Science. This library offers functions that perform the desired normalizations in a few simple lines of code.
However, it is important to put the use cases in context, because in practice it is not enough to apply a silly normalization to all the data that comes to hand when we have already normalized our training data.
Why not? For the simple reason that it is not possible to apply this same transformation to a test sample, or to new data.
It is obviously possible to center and reduce any sample in the same way, but with a mean and standard deviation that will be different from those used on the training set
The results obtained would not be a fair representation of the performance of the model, when applied to new data.

So, rather than applying the normalization function directly, it is better to use a Scikit-Learn feature called transformer API, which will allow you to adjust (fit) a preprocessing step using the training data.
ASo when normalization, for example, is applied to other samples, it will use the same saved means and standard deviations. To create this ‘adjusted’ preprocessing step, simply use the StandardScaler function and adjust it using the training data. Finally, to apply it to an array of data afterwards, simply apply scaler.transform().
It goes the same for a MinMax normalization.
Amazing, thank you Daniel!

If we want to train in Data Science with your advice, how do we do it?

Nothing could be easier, you just have to start soon one of our Data Science trainings 🙂 If you want to discover Daniel’s contribution during a training, find the interview of two of our alumni: