Today, we’re living in the golden age of data. Every e-mail received, every application downloaded, every click to check the weather creates a quantity of data. However, as the famous IT expression goes: Garbage In, Garbage Out. The information a company can derive from data is only as good as the data itself. Poor-quality data can lead to difficulties in extracting information, and ultimately to poor decision-making within the company.
What’s more, poor-quality data can have a major impact on a company’s organization. For example, a high percentage of incorrect email addresses in a database can distort the results of a marketing campaign, while an incorrect measurement system can lead to poor sales predictions.
That’s why it’s important for everyone involved in creating, manipulating or exploiting data to ensure its quality. The following is a non-exhaustive list of errors that can lead to data quality problems:
1- Poor understanding of the data environment
Ignorance of the nature of the data in our possession, or of the definitions of the variables present in the dataset, can lead to poor analysis of the latter, or worse, to the sharing of an inaccurate/approximate interpretation.
The first thing to do before exploring a dataset is to look at the metadata, i.e. the information we have about the data:
What is the source of the data? How was the data collected?
What kind of files do we have? How large are they?
What characteristics are present?
A dataset shared by the government and containing several gigabytes of data collected over many years is not the same thing as a dataset obtained after surveying a sample of the population.
Knowing your data is one way of avoiding many mistakes.
2 - Incomplete information
Missing values are a recurring topic in data science.
In statistics, missing data or missing values occur when no value is represented for a variable for a given observation.
Here are some of the reasons for missing data in a dataset:
- The user has forgotten to fill in a field.
- Data has been lost during a manual transfer from an old database.
- There has been a programming error.
- The user chose not to fill in a field because of their beliefs about how the results would be used or interpreted.
Sometimes these are simply random errors; other times it’s a systematic problem.
Missing values are common and can have a significant effect on analysis, prediction performance, or any use of data containing them.
Good management of missing data is therefore fundamental to the successful completion of a study.
To avoid problems, you first need to know which values should be considered as missing. For example, some variables contain spaces or special characters (‘?’, ‘\’, ..) which correspond to missing values, but are not necessarily recognized as such.
Subsequently, the replacement of missing values, or the deletion of the rows/columns concerned, must be done intelligently.
There’s no point in deleting a column containing 5 missing values out of 200,000, but a row with 60% missing data could do more harm than good to a machine learning model.
3 - Typographical errors and inaccurate data
Inexact data is any data with a conformity or veracity problem: a misspelled name, an incomplete address, a value that has nothing to do with the variable that contains it. A whole series of errors that can in most cases be corrected, but only if they can be recognized.
Today, many companies are challenged by inaccurate data, but even more so by their ability to identify them.
- How do you know if the results of a query are wrong?
- Especially if the answer seems correct?
If, during an internal survey, an analyst searches for his company’s monthly sales over the last 2 years, and comes up with a result of €100, he’s bound to have doubts about the veracity of the information. And he’ll be right – the value will most likely be inaccurate.
But if one of the sales figures is posted at €200,000, instead of €236,000, the analyst is unlikely to question it.
Inaccurate data is used to create new, poor-quality data and analyses, which can potentially lead to poor decision-making.
That’s why it’s so important to make sure, as soon as you collect or create data, that it’s accurate and doesn’t contain any errors that could cause trouble later on.
4 - Inconsistent format / Type problems
For example, if an organization maintains a database of its consumers, the storage format of the basic information must be predetermined. The name ( first name, last name or the reverse), date of birth (American/European style) or telephone number (with or without country code) must be recorded in exactly the same format.It can be very time-consuming later on for those required to handle this data to simply disentangle the many formats of the data present. Similarly, the type of variables must be predefined. For example, a variable representing a date, some of whose values are in the format datetime, while others are in the format text will inevitably cause problems for whoever uses it. Take care to always define the format and type of variables you create, or make sure these are uniform and consistent when you retrieve data.
5 - Doublons
Duplicates can result from collecting identical information from different sources, from human error, or from data being added instead of updated.Duplicates can distort any kind of data analysis, or even indirectly lead to bad decision-making. Moreover, the same data duplicated in several systems will have a different life cycle. They will eventually evolve and no longer have the same value, even if identified as similar. This will lead to potential risk-taking in choosing which version to use of duplicate data to make a critical business decision. Redundant data can also be very costly to the business if there is a lot of it. Fortunately, on Python, for example, there are functions ( such as drop_duplicates from Pandas) that make it very easy to get rid of duplicates.
6 - Units of measurement or multiple languages
Another recurring problem is the use of different languages, different types of code and different units of measurement.
Indeed, before assembling data from different sources, you need to make sure that they are compatible, or consider converting them.
There are many examples of disastrous mistakes made because these issues were not taken into account at the right time, such as NASA’s multi-million dollar Mars satellite, which crashed because its navigation software was programmed in Anglo-Saxon units of measurement rather than the metric system.
Similarly, processing data stored in several languages can also create difficulties if the analysis tools don’t recognize them or don’t know how to translate them. Even special characters such as umlauts and accents can wreak havoc if a system is not configured for them. So you need to take these potential problems into account when dealing with international data, and program your algorithms accordingly.
7 - Outliers
In statistics, an outlier is a value that differs greatly from the distribution of a variable. It is an abnormal observation, which deviates from otherwise well-structured data.
Detecting outliers or anomalies is one of the fundamental problems of data mining. The emerging and ongoing expansion of data is making us rethink the way we approach anomalies, and the use cases that can be built by examining these anomalies.
The use cases and solutions developed through anomaly detection are limitless.
For example, we now have smart watches and wristbands that can detect our heartbeats every few minutes.
Detecting anomalies in heart rate data can help predict heart disease.
In Data Science, outliers can affect certain statistical parameters, such as the mean. If outliers go undetected, this can distort our understanding of a dataset and lead us to make erroneous assumptions about it.
Another reason why it’s important to pay attention to outliers is that a majority of Machine Learning algorithms are very sensitive to the data they are trained on, as well as to their distributions.
Having Outliers in the training set of a Machine Learning model can make the training phase longer and potentially biased.
As a result, the prediction model produced will perform less well or be less accurate.
It’s easy to identify an Outlier when the observations form a one-dimensional set of numerical values. For example, you can clearly identify the outlier in the following list: [7, 2, 38600, 8, 4].
But when you have thousands of observations or multi-dimensions, outlier detection requires the use of certain statistical tools (such as standard deviation), graphical tools (such as Boxplot) or even algorithms such as clustering with DBSCAN.
8 - Processing errors
In Data Science, before data modeling, it is common to resort to certain mathematical transformations, such as normalizing the values of a variable, or switching from a categorical variable to a continuous or indicator variable.
These transformations are often linked to assumptions made about the data, or arise from the constraints of the algorithm you wish to use.
In all cases, it’s important to check that the calculations you make are correct and consistent. Sometimes, the results obtained may not correspond to your needs, leading to errors and misinterpretations later on.
9 -Definition problems
It is important to always be able to describe the variables contained in a dataset accurately. If the definition of a variable is not precise enough, it may be necessary to seek further information.
Sometimes, a variable can have several meanings or calculation methods for different organizations, countries or continents.
If you’re doing an analysis on the unemployment rate, for example, it’s important to remember that unemployment rates are calculated differently from one country to another. And even in France, the unemployment rate for INSEE and Pôle Emploi is not the same.
So be careful not to compare these two indicators, or to merge two tables without taking these differences into account.
10 - Compliance issues
Finally, it may seem obvious to some, but when handling data, it’s important to ensure that the company, its managers and employees comply with the legal and ethical standards applicable to them. In this way, you can help your company avoid the financial, legal and reputational risks incurred by organizations that fail to comply with laws, regulations, conventions, or simply a certain ethic or deontology.
Beyond the classic steps of cleansing and transforming data for analysis or modeling, following the advice given above will save you precious time and avoid many mistakes that can be costly to repair.
As mentioned above, the worst mistakes are those that are ignored right up to the decision-making stage, and which can prove critical for a company.
To optimize the quality of the data you’re working with, we’ve developed a curriculum that will enable you to put these tips into practice with Python.