One of the foremost excruciating pain points during the Exploration and Preparation stage of a Data Science project is missing values. How do I accommodate missing values – ignore or treat them? In this article, we will discuss a range of methods to impute the missing data.
When it comes to handling missing data – there is no right or wrong! The approach would rely on the share of missing values within the dataset, the variables which have missing values, whether those missing values are a component of dependent or the independent variables, etc.
So, before we jump into the the common imputation techniques, it is important to understand what missing data is!
So how exactly are missing values represented?
Python libraries represent missing values as “NaN” which is short for “Not a Number”. There are several reasons for missing values in a dateset. For example
- A house without a garage wouldn’t include an answer for How large is the garage
- Someone being surveyed may choose not to share their phone number
- Errors during the data extraction or collection stage
You can broadly categorize reasons for missing values as below
MCAR (Missing Completely at Random):
This is the highest level of randomness. Here the missing values are not dependent on missing or the known values. This effectively tells us that cause of missing data is unrelated to the dataset. An example of MCAR is a weighing scale that ran out of charge and giving no readings.
MAR (Missing at Random):
These missing values are also at random but there is always some association as it may be dependent on the known values (X variables) but not on the target variable (Y variable) or on the missing data itself. Taking the previous example, when a weighing scale placed on a soft surface, it will produce more missing values than when placed on a hard surface.
NMAR (Not Missing at Random):
These missing values have the pattern in between them and also, it’s dependent on other variables and on the missing value itself. This a serious issue and it’s wise to check the data gathering process and try to understand why the data is missing. For example, the mechanism of the weighing scale may wear out over time which produces more missing data as time passes, but we may fail to notice this.
Theoretically, you can classify missing values into these categories based on domain knowledge and analysis of the sample data and handle it accordingly. Now that we have seen what missing data is, we will look at some of the techniques used to handle them!
Data Imputation – an Overview
Okay! We have analyzed the data and found that there are many features that are missing! How do we fix it? – Well, there is no one size fits all strategy for data imputation. The strategy that you must follow is based on the domain or the type of feature you are handling. Often, you will follow a blend of the following strategies in the real world. So let’s get started!
For illustration, we will explain the impact of various data imputation techniques using scikit-learn‘s iris data set. you can load the dataset using the following code:
We can quickly check if the data has any missing values using the below command
There are 8 missing values across the features in the dataset. We will now go over some of the basic techniques for data imputation.
Data Imputation Techniques
Remove features with missing data :
This is the easiest way to handle missing data – we simply remove them! But this is not a recommended method to handle missing data. We choose this method only if the number of missing values less. This method simply removes all the records which have at least one or more missing values in a feature. For example, this dataset has 4 records with missing values.
So, to remove those records, we can do it by using panda’s dropna() function:
This has completely removed those 4 records. As a thumb rule, we can apply this method if the % of the missing value is less than 1-2%. Better, avoid using this unless you don’t find a meaningful way to impute the data.
- This is by far the only easier method.
- This is suitable for MCAR.
- It leads to a loss of sample size.
- It also produces biased estimates and parameters.
The most commonly used imputation technique in Machine learning is replacing the missing values with mean, median, and mode of the non-missing values in a column. It is far from foolproof, but a very easy technique to implement and generally required less computation. Mean & Median imputations are for Numerical features, while the mode is used for categorical features. The only disadvantage of this method is it reduces the variance of the dataset.
Generally this technique can be used in 2 ways and the later is recommended of the two :
i) Generalized Imputation:
Here we take the average of the entire feature and impute that value for the missing values.
For example, here the entire mean of the ‘Sepal length’ and imputed.
ii) Simple Case Imputation:
Here the mean is calculated by keeping in the specific groups. For example, here the specific species is taken into consideration and it’s grouped and the mean is calculated. That mean is imputed to its respective group’s missing value.
The advantages of mean imputation are that if dealing with a continuous variable that is not related to any other independent variable then mean/median imputation doesn’t lead to a loss inefficiency.
However, some of the major disadvantages are reduction in variance, the introduction of bias in multivariate estimates, impact to the distribution of data, and introduction of invalid data to the sample. Experiments have even shown Mean Imputation performing worse than the method deletion .
Median Imputation is less vulnerable to outliers as a dataset having outliers can skew the mean estimate causing more harm than good whereas median imputation is less affected by the outliers .
Though this method has some advantages, it is not a recommended approach in most real world cases. With this we move on to some of the advanced techniques for data imputation!
K-Nearest Neighbor Imputation:
It can be done in two ways; one is by using the K-nearest Neighbour Algorithm and the second one is by using the K-means Clustering Algorithm. They are quite similar in functionality. In K-Nearest Neighbours, the missing values are found by calculating the K number of observations that are most similar to the instance in question.
For example, here in the dataset where we have, four variables- SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm and Species. We have missing values in the variable – “Species”. Here we use KNN where we will consider the SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm of the observation where the missing value is and based on the most species, impute the missing value. Here we presume that species with similar SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm are from similar species.
Thus, the assumption behind using KNN for missing values is that a missing value may be approximated by the values of the points that are closest to that missing value, based on other variables . Similarly, K-means Clustering Algorithm considers similar clusters for imputation.
The only disadvantage of this method is it will take a lot of time if the dataset has many variables and also the selection of the k-value is quite critical. Generally, if compared, KNN Imputation provides better results when compared to K-means Clustering Imputation.
In addition to this, we also have many regression based imputation methods
Predictive Models (Regression Methods):
This approach uses Regression-based models to find the missing value. In this method, the variable with missing value is considered the target variable, and the remaining variables become the features for prediction.
Many advanced imputation libraries are designed with this principle & Algorithms like Linear regression Logistic Regression is used to predict the missing values.
One in all the benefits of using predictive models to estimate missing values is that several times the features have some underlying relationship to every other which the predictive models can use to estimate the missing values thereby maintaining these relationships within the final dataset. However, the key drawback of this method is that if there’s less or no relationship between the target variables and predictor variables then the missing values won’t be predicted correctly. Thus, we’ve got to assume that attributes have relationships (correlations) among themselves. This method works well if the data is MAR.
Missing value imputation is one of the key steps in a machine learning pipeline and it could make or break your model. In this article, we have discussed some of the basic imputation techniques and there are many advanced imputation techniques available. The imputation strategy should be decided based on the type of data missing, domain expertise, etc.
you can download the code from below:
I hope you found this article helpful. Happy Learning!