Handle missing values in Datasets
Popular strategies to handle missing values in the dataset using sklearn module
We often come across datasets with some missing values. If the dataset has too many missing values, we might end up not using it at all. However if there are only a few missing values, we can perform data imputation to fill in the missing values.
Here are some of the ways we can handle missing values in Dataset:
1. Delete the rows
Possibly the simplest way to handle missing values would be to delete the rows with missing values. Say we have 1000 rows in which 5 of them have missing values, we can decide to work with the 995 data rows instead.
2. Using the mean/median value
This method only works with numerical values but we can fill in the missing values with mean or median of the column.
from pandas import DataFrame
from numpy import nan
data = DataFrame([[ 1., 2., nan, 2.],
[ 5., nan, 1., 2.],
[ 4., nan, 3., nan],
[ 5., 6., 8., 1.],
[nan, 7., nan, 0.]])
print('{}\n'.format(repr(data)))
from sklearn.impute import SimpleImputer
# SimpleImputer by default imputes using mean
imp_mean = SimpleImputer()
transformed = imp_mean.fit_transform(data)
print('{}\n'.format(repr(transformed)))
# Data imputation using median
from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(strategy='median')
transformed = imp_median.fit_transform(data)
print('{}\n'.format(repr(transformed)))
Output:
0 1 2 3
0 1.0 2.0 NaN 2.0
1 5.0 NaN 1.0 2.0
2 4.0 NaN 3.0 NaN
3 5.0 6.0 8.0 1.0
4 NaN 7.0 NaN 0.0
array([[1. , 2. , 4. , 2. ],
[5. , 5. , 1. , 2. ],
[4. , 5. , 3. , 1.25],
[5. , 6. , 8. , 1. ],
[3.75, 7. , 4. , 0. ]])
array([[1. , 2. , 3. , 2. ],
[5. , 6. , 1. , 2. ],
[4. , 6. , 3. , 1.5],
[5. , 6. , 8. , 1. ],
[4.5, 7. , 3. , 0. ]])
3. Using most frequent value
This method can be used for categorical values as well, and this method replaces all the missing values with most frequent value in the column.
# Data imputation using most frequent value
from sklearn.impute import SimpleImputer
imp_frequent = SimpleImputer(strategy='most_frequent')
transformed = imp_frequent.fit_transform(data)
print('{}\n'.format(repr(transformed)))
# output:
array([[1., 2., 1., 2.],
[5., 2., 1., 2.],
[4., 2., 3., 2.],
[5., 6., 8., 1.],
[5., 7., 1., 0.]])
4. Fill in with constant
We can choose to hard-code what value to fill in the missing values.
# Data imputation using constant
from sklearn.impute import SimpleImputer
imp_constant = SimpleImputer(strategy='constant',
fill_value=-1)
transformed = imp_constant.fit_transform(data)
print('{}\n'.format(repr(transformed)))
Output:
array([[ 1., 2., -1., 2.],
[ 5., -1., 1., 2.],
[ 4., -1., 3., -1.],
[ 5., 6., 8., 1.],
[-1., 7., -1., 0.]])
5. Prediction of missing values
Regression or classification model (depending on type of data) can be used to predict the missing values. We can train our model with the available values and then predict the missing values.
6. Imputation using KNN
This method uses the k nearest neighbours
algorithm to predict values based on feature similarity.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
transformed = imputer.fit_transform(data)
print('{}\n'.format(repr(transformed)))
#output
array([[1. , 2. , 2. , 2. ],
[5. , 4.5, 1. , 2. ],
[4. , 4. , 3. , 2. ],
[5. , 6. , 8. , 1. ],
[5. , 7. , 4.5, 0. ]])
Conclusion
Apart from what we have discussed here, there are other methods for data imputation such as MICE(Multivariate Imputation by Chained Equation), using deeplearning (datawig module), intrapolation and extrapolation, hot-deck imputation etc.
There is no perfect way to compensate for the missing values, but we might need to decide on our strategy based on our dataset and the missing values.
That's it for now.
Happy coding ! Cheers ๐ป