One of the most common problems I have faced in the data cleaning/exploratory analysis is handling missing values: how to deal with missing data. First, understand that there is NO right way to deal with missing data.
Contents
ToggleHow to deal with missing data: the methodology
Before moving on to data imputation methods, we need to understand why the data is missing.
Missing at random (MAR): Missing at random means that the propensity of a data point to miss is not related to missing data, but it is related to some of the observed data
Missing Completely at Random (MCAR): The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables.
Missing non-random data (MNAR): two possible reasons are that the missing value depends on the hypothetical value (for example, people with high salaries generally do not want to reveal their income in surveys) or that the missing value depends on the value another variable (e.g. Suppose that women generally do not want to reveal their age! Here, the missing value in the age variable is impacted by the sex variable).
In the first two cases, it is safe to delete data with missing values based on their occurrences, while in the third case, deleting observations with missing values may produce a bias in the model. We must therefore be very careful before deleting observations. Note that imputation does not necessarily give better results.
Deletion
- Listwise
List deletion (full case analysis) deletes all data from an observation that has one or more missing values. In particular, if the missing data is limited to a small number of observations, you can simply choose to eliminate these cases from the analysis. However, in most cases it is often disadvantageous to use list deletion. Indeed, the assumptions of MCAR (Missing Completely at Random) are generally rare to support. Therefore, list deletion methods produce biased parameters and estimates.
newdata <- na.omit(mydata)# In python
mydata.dropna(inplace=True)
- Pairwise
Pairwise deletion analyzes all cases in which the variables of interest are present and thus maximizes all the data available by an analysis base. One of the strengths of this technique is that it increases the power of your analysis, but it has many drawbacks. It assumes that the missing data is MCAR. If you delete in pairs, you will end up with a different number of observations contributing to different parts of your model, which can make interpretation difficult.
#Pairwise Deletion
ncovMatrix <- cov(mydata, use="pairwise.complete.obs")#Listwise Deletion
ncovMatrix <- cov(mydata, use="complete.obs")
- Dropping Variables
It is always better to keep data than to throw it away. Sometimes you can drop variables if data is missing for more than 60 % of cases, but only if that variable is insignificant. That said, imputation is always a better choice than dropping variables.
df <- subset(mydata, select = -c(x,z) )
df <- mydata[ -c(1,3:4) ]In python
del mydata.column_name
mydata.drop('column_name', axis=1, inplace=True)
Time series methods
- Last Observation Carried Forward (LOCF) and Next Observation Carried Backward (NOCB)
This is a common statistical approach to thedata analysis longitudinal repeated measures where some follow-up observations may be missing. Longitudinal data follows the same sample at different points in time. Both of these methods can introduce bias into the analysis and give poor results when the data shows a visible trend.
- Linear interpolation
This method works well for a time series with some trend, but is not suitable for seasonal data.
- Seasonality + interpolation
This method works well for data with both trend and seasonality.
library(imputTS)na.random(mydata) # Random Imputation
na.locf(mydata, option = "locf") # Last Obs. Carried Forward
na.locf(mydata, option = "nocb") # Next Obs. Carried Backwards
na.interpolation(mydata) # Linear Interpolation
na.seadec(mydata,algorithm="interpolation") # Seasonal Adjustment then Linear Interpolation
Imputation (conventional methods)
Mean, median and mode
Calculating the overall mean, median or mode is a very basic imputation method, it is the only function tested that does not take advantage of the characteristics of the time series or the relationship between the variables. It is very fast, but has obvious drawbacks. A disadvantage is that mean imputation reduces the variance in the data set.
library(imputTS)na.mean(mydata, option = "mean") # Mean Imputation
na.mean(mydata, option = "median") # Median Imputation
na.mean(mydata, option = "mode") # Imputation ModeIn Python
from sklearn.preprocessing import Impute
values = mydata.values
impute = impute(missing_values='NaN', strategy='mean')
transformed_values = imputer.fit_transform(values)# strategy can be changed to "median" and “most_frequent”
Linear regression
To start, several predictors of the variable with missing values are identified using a matrix of correlation. The best predictors are selected and used as independent variables in an equation of regression. The variable with missing data is used as the dependent variable. Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.
In an iterative process, the values of the missing variable are inserted, and then all cases are used to predict the dependent variable. These steps are repeated until there is little difference between the predicted values from one step to another, that is, they converge.
It “theoretically” provides good estimates for missing values. However, this model has several disadvantages that tend to outweigh the advantages. First, because the substituted values were predicted from other variables, they tend to agree "too well" and the standard error is therefore deflated. It should also be assumed that there is a linear relationship between the variables used in the regression equation when there may not be one.
Multiple imputation
- Imputation: Imput m times the missing entries of the incomplete datasets (m = 3 in the figure). Note that the imputed values are taken from a distribution. The draw simulation does not include uncertainty in the model parameters. A better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete datasets.
- Analysis: Analyze each of the m completed datasets.
- Pooling: integrate m-analysis results into a final result
# We will be using mice library in r
library(mice)
# Deterministic regression imputation via mice
imp <- mice(mydata, method = "norm.predict", m = 1)
# Store data
data_imp <- complete(imp)
# Multiple Imputation
imp <- mice(mydata, m=5)#build predictive model
fit <- with(data = imp, lm(y ~ x + z))#combine results of all 5 models
combine <- pool(fit)
This is by far the preferred imputation method for the following reasons:
- Easy to use
- No bias (if the imputation model is correct)
Categorical Data Imputation
- Modal imputation is one method, but it will certainly introduce bias.
- Missing values can be treated as a separate category in itself. We can create another category for missing values and use them as a different level. This is the simplest method.
- Prediction models: Here we create a predictive model to estimate the values that will replace the missing data. In this case, we split our data set into two sets: one set with no missing values for the variable (training) and another with missing values (test). We can use methods like logistic regression and ANOVA for prediction.
- Multiple imputation.
With machine learning (knn) - see other courses for more elaborate methods
There are other machine learning techniques such as XGBoost and Random Forest for data imputation, but we will discuss KNN as it is widely used. In this method, k neighbors are chosen based on a measure of distance and their average is used as an imputation estimate.
The method requires the selection of the number of nearest neighbors and a distance metric. KNN can predict both discrete attributes (the most frequent value among the k nearest neighbors) and continuous attributes (the mean among the k nearest neighbors)
The distance metric varies by data type:
- Continuous data: Commonly used distance metrics for continuous data are Euclidean, Manhattan, and Cosine.
- Categorical data: The Hamming distance is usually used in this case. It takes all the categorical attributes and for each, count one if the value is not the same between two points. The Hamming distance is then equal to the number of attributes for which the value was different.
One of the most attractive features of the KNN algorithm is that it is simple to understand and easy to implement. The non-parametric nature of KNN gives it an advantage in certain contexts where the data may be highly "unusual".
One of the obvious disadvantages of the KNN algorithm is that it takes time when analyzing large datasets, as it searches for similar instances in the dataset.
Additionally, the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor.
library(DMwR)
knnOutput <- knnImputation(mydata)In python
from fancyimpute import KNN
# Use 5 nearest rows which have a feature to fill in each row's missing features
knnOutput = KNN(k=5).complete(mydata)
Of all the methods described above, multiple imputation and KNN are widely used, and multiple imputation being simpler is generally preferred.