Daniel is the technical support of DataScientest’s trainings. It is the expert on every subject related to data science, and which is now walking every of our learners through their trainings. He’s usually very busy… But today, we have managed to get a quick interview with him, so that he can answer a few of our questions about data normalization.
Me – Hello Daniel, I know you must have been asked this question already, but I always hear about data normalization. Could you help me understand the concept better?
Deux principaux procédés sont en réalité sous-entendus lorsqu’on parle de normalisation : la normalisation et la normalisation standard, ou plus communément connu sous le nom de standardisation. Dans l’ensemble, ces deux procédés ont la même vocation : redimensionner les variables numériques pour qu’elles soient comparables sur une échelle commune.
Me - Mathematically speaking, what does it mean?
The process of normalization only needs the min and max functions
The idea is to bring back all the values of the variable between 0 and 1, while keeping some distance between the values.In order to do that, you’ll use a simple formula:
As far as standardization , is concerned, the transformation is more subtle than easily bring back the values between 0 and 1 It aims at bringing back the mean μ to 0 and the standard deviation to 1.Here again, the process is not very complicated: if you already know the mean μ and the standard deviation σ σ of a variable X = x1 x2 xn you will write the standardized variable as followed:
Me – All of this is amazing, but how is it related to data science?
Working with data of different can be a problem in an analysis, because a numerical variable with a range of values between 0 and 10,000 will carry more weight in the analysis than a variable with values between 0 and 1, which would cause a bias problem later on.However, be careful not to consider normalization as an mandatory step in the processing of our data, it constitutes a loss of information in the immediate future and can be detrimental in certain cases!
I understand better, but a question remains, how do you normalize data concretely?
However, it is important to put the use cases in context, because in practice it is not enough to apply a silly normalization to all the data that comes to hand when we have already normalized our training data.
Why not? For the simple reason that it is not possible to apply this same transformation to a test sample, or to new data.
It is obviously possible to center and reduce any sample in the same way, but with a mean and standard deviation that will be different from those used on the training set
The results obtained would not be a fair representation of the performance of the model, when applied to new data.
So, rather than applying the normalization function directly, it is better to use a Scikit-Learn feature called transformer API, which will allow you to adjust (fit) a preprocessing step using the training data.
ASo when normalization, for example, is applied to other samples, it will use the same saved means and standard deviations. To create this ‘adjusted’ preprocessing step, simply use the StandardScaler function and adjust it using the training data. Finally, to apply it to an array of data afterwards, simply apply scaler.transform().