Welcome to the third episode in our series, 'Python Programming for Beginners.' In the previous episode, you delved into various Python operators and gained insight into how loops and a range of essential functions operate in Python. In this installment, you'll embark on a journey through two critical phases for any machine learning project: data cleaning and data processing.
Python Programming for Beginners - Importing data:
To accomplish this, we rely on the Pandas module.
Developed to provide Python with the necessary tools for handling large volumes of data, this module aims to become the ultimate data manipulation tool: both high-performing, user-friendly, and versatile.
Pandas is currently the most widely used module for managing databases in formats like CSV, Excel, and more.
One of the most common data types is the CSV format, which stands for comma-separated values.
The general structure of CSV files uses rows as observations and columns as attributes.
Here’s the procedure for reading CSV files:
You can also add additional arguments to the pd.read_csv function:
The ‘sep’ argument in the function allows you to specify the type of separator for reading your data (e.g., ‘,’, ‘/’, ‘;’, ‘-‘).
The ‘header’ argument determines the row number that contains the variable names; by default, header = 0.
The ‘index_col’ argument lets you designate a column in the DataFrame as the index
If we don’t specify a sheet name, the first one is displayed by default. If we want to display a particular sheet, we can do so using the last line of code above by specifying the sheet’s name.
Before proceeding with data modeling, several preliminary steps are essential for a machine learning project. Although they can be tedious, it’s crucial not to overlook them as they ensure the quality of our modeling. We often refer to the principle of ‘garbage in – garbage out.’ A model with suboptimal data quality cannot make accurate predictions.
The preliminary steps, also known as data preprocessing steps, encompass activities such as data quality assessment, data cleaning, and data preparation.
First and foremost, it’s crucial to have an overview of the data. For instance, data preparation depends on the variable types. In the case of numerical data, it’s essential to understand the value range, mean, standard deviation, and other basic statistics. This verification step also includes checking for duplicates.
Handling missing data
The databases we encounter in our daily work often contain missing data. There are numerous reasons behind this issue: typographical errors leading to data omissions, initial absence of data, and more.
Data deletion should be the last resort method as it results in a significant loss of information. The more information we have, the better our model’s prediction quality.
Nevertheless, in some cases, we may still opt for this strategy.
Generally, we choose to delete a data row when it is riddled with missing data, especially if the most important variables are missing, such as the target variable. We may decide to delete an entire variable if the majority of the data is missing, and this variable has little impact on the predicted variable.
Instead of deleting this data, we prefer value imputation. This method allows us to reduce information loss. Several methods are available to us. For numeric variables, we can impute the median or mean value of that variable (taking care with extreme values if imputing the mean). If dealing with a categorical variable, we can impute the most frequent value within the variable using the mode.
A more advanced imputation method involves imputing values while considering values within other variables. This allows us to calculate the mean, median, or mode by groups of individuals within the DataFrame to refine the value to be imputed.
An outlier, or aberrant value, corresponds to a value that is distant from the variable’s distribution. This can be due to a typographical error, a measurement error, or it can also be an extreme value. The term ‘extreme value’ is commonly used to refer to a non-erroneous value that deviates significantly from the rest of the values in the variable.
One relatively simple way to detect these values is by creating a box plot for each of the variables. A box plot is a graphical representation in the form of a rectangle that describes the statistics of the variable (quartiles Q1, median, Q3). The boundaries of the plot delineate values according to the variable’s distribution. Beyond these boundaries, these values are considered outliers.
In the context of a Machine Learning project, it’s often a choice to remove an outlier. Indeed, to achieve better prediction quality, it’s necessary to address these data points because a model can be very sensitive to extreme data, which can bias predictions
When dealing with textual categorical data, it’s necessary to encode our variables. Binary encoding is used when encoding a variable into two categories, typically 0 and 1. For example, a variable describing a person’s gender can be encoded as 0 and 1. A categorical variable (with more than two categories) can also be encoded based on the number of categories it contains.
However, the strategy commonly adopted is the creation of dummy variables. From a variable with n categories, we create n-1 binary-encoded columns (0-1), each corresponding to one of the categories within the variable. We remove one column because this column can be deduced from the others. It’s essential that the variables are not correlated for prediction quality
Feature Scaling: Normalization and Standardization
The variables that make up a DataFrame are not always on the same scale. For example, age and height have different units of measurement. To balance the weight of each variable within the DataFrame, it’s important to scale the variables, essentially putting them on the same scale. This step is crucial, especially in models that use measures of linear distance and can be significantly affected by such imbalances.
There are two scenarios: when a variable follows a normal distribution and when it doesn’t.
If it doesn’t follow a normal distribution, we should apply ‘min-max scaling,’ which scales its values to an interval [0,1].
X’=(X-X_min)/(X_max – X_min)
“If the variable follows a normal distribution, we refer to it as standardization. To achieve this, we subtract the mean and divide by the standard deviation.
X’=(X – μ)/σ
Where µ (mu) and σ (sigma) are the mean and standard deviation of the distribution, respectively.
Python Programming for Beginners - Conclusion:
So, these are the main steps in the data preprocessing phase. Of course, there are other procedures you can apply to your databases in a more specific manner! The better your DataFrame variables are processed, the easier it will be to create variables and build models