Missing Data: How to effectively manage them in data science?

Q: Understanding the Nature of Missing Data

A missing value refers to the absence of a value in a dataset and can be identified using tools like missingno or descriptive functions such as .isnull().sum().

Q: Why is Data Missing? Mechanisms of Data Loss?

There are three mechanisms: MCAR (random), MAR (related to other variables), and MNAR (dependent on the missing value itself).

Q: Strategies for Handling Missing Data

Common strategies include deletion (listwise, pairwise, variable) and imputation (simple or advanced like regression and k-NN).

Q: Choosing the Right Strategy and Evaluating its Impact

Choose a method based on variable type, missing rate, and variable importance, then evaluate impact through distribution comparisons and cross-validation.

15 Oct 2025

min read

Data Science

Daniel

In the real world, perfectly complete datasets are rare. Whether it’s due to manual entry, automated extraction, or merging multiple sources, missing data is ubiquitous. If not properly managed, it can skew analyses, reduce model performance, and introduce significant biases.

Understanding the nature and mechanisms behind these gaps is therefore essential. While it may be tempting to overlook missing data, doing so often means ignoring a significant part of the problem.

In this article, we will explore in detail how to identify, categorize, and handle missing data in data science. We will also address the criteria for choosing an imputation method and best practices for minimizing their impact.

Understanding the Nature of Missing Data

Definition and Identification of Missing Data

A missing value refers to the absence of a value in a cell of a dataset. It can be represented as NaN, None, an empty cell, or indicators like “N/A”.

Several tools exist to identify them:

Visual exploration: libraries like missingno (Python) allow you to visualize patterns of missing values (e.g., heatmaps, matrices).
Descriptive functions: in Python, .isnull().sum() on a Pandas DataFrame gives the number of missing values per column.

Why is Data Missing? Mechanisms of Data Loss?

Understanding why a value is missing is fundamental. Three mechanisms are classically distinguished:

MCAR (Missing Completely at Random): The probability that a value is missing is independent of all other variables.
Example: a random failure during data collection.
MAR (Missing at Random): The absence depends on other observed variables, but not on the missing value itself.
Example: men respond less often to a question about depression than women — the absence depends on gender.
MNAR (Missing Not at Random): The absence depends on the missing value itself or an unobserved factor.
Example: very high incomes are rarely reported — it is the value itself that influences the absence.

Impact of Different Types of Missingness

The type of mechanism profoundly influences the treatment strategy. While MCAR permits simple treatments, MAR and MNAR require more complex, even domain-specific methods.

Strategies for Handling Missing Data

1. Deletion of Missing Data

Listwise Deletion

This method involves deleting all rows of a dataset containing at least one missing value. It is commonly used because it is straightforward to implement.

Pairwise Deletion

This approach involves using all available data for each specific analysis, without necessarily excluding an entire row. For instance, a correlation between two variables will only use observations for which these two variables are present.

Variable Deletion

This method involves deleting an entire column if the percentage of missing values is too high (often >50%). It may be relevant when the concerned variable is difficult to recover or not very useful.

Method	Advantages	Disadvantages
Listwise deletion	- Easy to implement - No artificial data added	- Significant data loss if not MCAR - Risk of bias
Pairwise deletion	- Retains more data - Less destructive	- Hard-to-interpret results - Unstable statistical matrices
Variable deletion	- Fast cleaning - Dimensionality reduction	- Risk of removing a relevant variable

2. Simple Imputation

Imputation by Mean, Median, or Mode

This approach replaces missing values with measures of central tendency. The mean and median are used for numerical variables, while the mode applies to both categorical and numerical variables.

Imputation by a Constant Value or Binary Indicator

An arbitrary value (like -1 or “Unknown”) is used to replace missing data. Sometimes, a new binary variable is added to indicate if the original value is missing.

Method	Advantages	Disadvantages
Mean / Median / Mode	- Easy and quick - Low resource consumption	- Reduces variance - Can distort distribution and correlations
Constant value / Indicator	- Preserves missingness information - Compatible with certain models	- May introduce bias - Sensitive to arbitrarily chosen value

3. Advanced Imputation

Imputation by Regression

This involves predicting the missing value using a regression model that employs other variables in the dataset as predictors.

Imputation by k-Nearest Neighbors (k-NN)

Missing values are imputed by taking the average of the k most similar observations, measured using a distance between observed variables.

Method	Advantages	Disadvantages
Regression	- Leverages inter-variable relationships	- Risk of bias if assumptions are violated - May overestimate variable relationships
k-Nearest Neighbors (k-NN)	- Captures complex relationships - Suitable for numeric and mixed data	- High computational cost - Sensitive to choice of k and distance metric

Choosing the Right Strategy and Evaluating its Impact

Key Factors for Selecting a Method

The choice of a method for handling missing data depends on several factors. First, the nature of the variables (numerical, categorical, or mixed) guides the choice of techniques: some methods like mean or regression imputation are applied mainly to numerical variables, while mode or constant values are suitable for categorical variables.

The rate of missing values is also crucial: below 5%, simple approaches may suffice, but beyond 20%, it becomes risky to delete data or use naive imputations.

Moreover, it’s important to determine whether a variable truly impacts the target or ongoing analyses. If a variable has many missing data and does not provide useful information, it is often preferable to delete it to avoid costly and unnecessary processing while simplifying the model or visualizations. This approach is particularly relevant when exploratory analyses or correlation tests show that the variable is weakly related to others.

Once the method is applied, it is essential to evaluate its impact: compare variable distributions before/after imputation, measure model performance via cross-validation, and conduct a sensitivity analysis by testing multiple strategies to ensure the robustness of the results.

Conclusion

The handling of missing data is an inevitable challenge in data science. Understanding their origins, identifying their nature, choosing the right imputation method, and evaluating its impact are critical steps to ensure the reliability of analyses.

Rather than seeking a single solution, it is often preferable to test multiple approaches tailored to the specific context. With the evolution of tools and techniques, the management of missing data is becoming increasingly sophisticated and integrated into the processing pipeline.

Adopting a rigorous, transparent, and informed approach remains key to addressing this central issue.

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

Data Marketing & AI

MLOps

ETL Developer

Data Ops Engineer

Amazon Web Services (AWS)

Microsoft Power BI

Overview

Bildungsgutschein

For Employees

Missing Data: How to effectively manage them in data science?

Understanding the Nature of Missing Data

Definition and Identification of Missing Data