Data Cleaning : The different steps

The first step is to clean up any incorrect, incomplete or missing data. There are several ways of dealing with these problems, which we'll look at below. If there is missing data in the dataset, you can choose to ignore it if the database is quite full and there is a lot of missing data in the same row. You can also decide to fill in the missing data in different ways: you can replace it with the mean value, with the median, or, for example, with the most frequent modality in the case of categorical variables (otherwise known as the mode).

Data Transformation: For what purpose?

This pre-processing stage groups together the changes made to the structure of the data itself. These transformations are linked to the mathematical definitions of the algorithms and the way they process the data, so as to optimize performance.

Data Reduction: What is it?

Although it seems intuitive to imagine that a very large amount of data improves the performance of a model, it may be that too much data can make analysis more complicated. It can therefore sometimes be worthwhile to reduce the quantity or size of data, in order to improve storage capacities and reduce analysis costs, without losing performance (or, in some cases, even gaining it). There are a number of data reduction techniques, for example, we can choose a certain number of variables that we prefer to keep and drop others. The choice of relevant variables can be made by analysis of the variable's p-value, or by decision tree techniques which give us an estimate of the importance of the various descriptors.

Data Integration: An important step in preprocessing

This step in the preprocessing strategy involves combining multiple sources into a single dataset. It is carried out within a data management framework for the creation of exploitable databases (such as the creation of image databases, cross-sections of the abdomen, MRIs or X-rays for diagnostic aid problems). There are, however, certain problems that could arise, such as incompatibility of certain formats or redundancy of certain data.

Back to articles

Preprocessing: What is it? How does it work?

26 Nov 2023

min read

Data Science

Melanie

The proliferation of data acquisition and systematic processing has facilitated the rise of machine learning methods that require ample data for training and operation. While one might naively assume that having a large amount of data is sufficient for a high-performing algorithm, the data we have is often not well-suited, and preprocessing is typically necessary before using them: this is the preprocessing step.

Indeed, acquisition errors due to human or technical mistakes can corrupt our dataset and introduce bias during training. Among these errors, we can mention incomplete information, missing or incorrect values, or even noisy artifacts related to data acquisition.

Therefore, it is often essential to establish a data preprocessing strategy – also known as Data Preprocessing – starting from our raw data to obtain usable data that will lead to a more efficient model. We will explore the most important steps of this preprocessing, their significance, and their implementation in Python for some of them.

Data Cleaning: The different stages

The first step involves cleaning incorrect, incomplete, or missing data. There are several ways to address these issues, which we will review.

If there is missing data in the dataset, you can choose to ignore it if the database is sufficiently large and if many data points are missing in the same row.

Alternatively, you can decide to fill in these missing data in different ways: you can replace them with the mean, median, or, for categorical variables, with the most frequent mode.

Pandas provides us with methods that allow us to perform these treatments as follows:

Data can sometimes suffer from acquisition noise, in which case they may not be correctly processed by a computer. One way to address this issue is to perform data binning (after sorting the data). The data is divided into groups of the same size, and each group is treated independently.

Within the same group, all data points can be replaced by their mean, median, or extreme values.

Another way to handle noisy data is to use regression or clustering, which automatically creates data groups that can help us detect outliers and remove them from the database.

Data Transformation: What for?

This preprocessing step involves changes made to the data structure itself. These transformations are related to the mathematical definitions of algorithms and how they process data in order to optimize performance. Among these techniques, we can mention:

– Data smoothing if the data is noisy.
– Data aggregation from various different sources.
– Discretization of continuous variables (using interval splitting) to reduce the number of categories for a descriptor.
– Normalization and standardization of data, which scale numerical data to a smaller range (e.g., between -1 and 1) and can also center the mean and reduce variance.

Here’s an example of how to perform normalization, which is often necessary in this data transformation part:

Data Reduction: Was ist das?

While it may seem intuitive that a large amount of data improves the performance of a model, having an excessively large dataset can sometimes make the analysis more complex.

Therefore, it can be interesting to reduce the quantity or dimension of the data to improve storage capabilities and reduce analysis costs without sacrificing performance (and sometimes even gaining performance in certain cases). There are several data reduction techniques.

For example, we can choose to keep a certain number of variables we prefer and drop others. The selection of relevant variables can be done through analysis of the variable’s p-value or through decision tree techniques that provide an estimation of the importance of different descriptors.

Another widely used technique for data reduction is dimensionality reduction. This method reduces the dimension of the data through well-defined encoding mechanisms.

There are two types of dimensionality reduction: with or without loss. If you can reconstruct the exact data from the reduced data, it’s called lossless reduction. Otherwise, the reduction is lossy. Two preferred methods for this type of data reduction are wavelet transformation or Principal Component Analysis (PCA).

Data Integration: Ein wichtiger Schritt in der Vorverarbeitung

This step in the preprocessing strategy involves combining multiple sources into a single dataset. It is carried out in a data management framework for creating usable databases (such as creating image databases, cross-sectional abdominal scans, MRI or X-rays for diagnostic support).

However, some issues may arise, such as the incompatibility of certain formats or data redundancy.

The preliminary data processing step is one of the most crucial in data processing and analysis.

There is no perfect method to apply to every model creation, but we have discussed best practices to implement in a data preprocessing strategy.

The methods presented here are explored in more depth in our various training programs, where fundamental mathematical concepts and best practices for data preprocessing based on context and situation are explained.

To discover our curriculum in detail and learn all the best practices of data preprocessing, find out more by clicking below:

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI

Preprocessing: What is it? How does it work?

Data Cleaning: The different stages

Data Transformation: What for?

Data Reduction: Was ist das?

Data Integration: Ein wichtiger Schritt in der Vorverarbeitung

You are not available?

Related articles

How Marketing Data Shapes Buying Decisions

Crypto AI agents: How AI is revolutionizing cryptocurrencies?

Oracle Infrastructure Cloud Services: Storage, Computing, Networking…

AI Insights in Power BI: AI at the service of decision-making!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI

Preprocessing: What is it? How does it work?

Data Cleaning: The different stages

Data Transformation: What for?

Data Reduction: Was ist das?

Data Integration: Ein wichtiger Schritt in der Vorverarbeitung

You are not available?

Related articles

How Marketing Data Shapes Buying Decisions

Crypto AI agents: How AI is revolutionizing cryptocurrencies?

Oracle Infrastructure Cloud Services: Storage, Computing, Networking…

AI Insights in Power BI: AI at the service of decision-making!

DataNews