Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

We call this a sentinel value: when present, it indicates a missing (or null) value.

The isna method gives us a Boolean Series with True where values are null:

data.isna().apply(lambda x: sum(x))
id            0
zip           1
phone        27
lat          30
lng          30
type          0
score        15
risk         17
violation    17
dtype: int64

In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA.

Method Description
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna Fill in missing data with some value
isna Return Boolean values indicating which values are missing/NA.
notna Negation of isna, returns True for non-NA values and False for NA values.

isna and notna

The isna and notna functions should be used to detect missing values. These functions are available as top-level methods in the pandas namespace, as well as instance methods on Series and DataFrame objects.

import pandas as pd
import numpy as np

data = pd.Series([1, np.nan, 'hello', None])

data.isna()
0    False
1     True
2    False
3     True
dtype: bool
data.notna()
0     True
1    False
2     True
3    False
dtype: bool

These functions can also be used with DataFrame objects:

data = pd.DataFrame([[1, 2, 3], [4, np.nan, 6], [7, 8, None]])
data
     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   6.0
2  7.0  8.0   NaN
data.isna()
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
data.notna()
       0      1      2
0   True   True   True
1   True  False   True
2   True   True  False

dropna

The dropna method can be used to drop rows or columns with missing data. By default, it drops any row containing a missing value:

data.dropna()
     0    1    2
0  1.0  2.0  3.0

To drop columns with missing data instead of rows, use the axis keyword:

data.dropna(axis='columns')
     0
0  1.0
1  4.0
2  7.0

By default, dropna drops rows or columns that contain any missing data. You can change this to drop rows or columns that contain all missing data by passing the how keyword:

data[3] = np.nan
data
     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN
data.dropna(how='all')
     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN
data.dropna(axis='columns', how='all')
     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   6.0
2  7.0  8.0   NaN

Dropping rows or columns with missing data is a common operation when cleaning up data for analysis. However, it is important to consider whether this is appropriate for your data, as it can introduce bias if the missing data is not random.

You can also specify a threshold for the minimum number of non-missing values required to keep a row or column using the thresh keyword:

data
     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN
data.dropna(thresh=3)
     0    1    2   3
0  1.0  2.0  3.0 NaN
1  4.0  NaN  6.0 NaN
data.dropna(axis='columns', thresh=3)
     0
0  1.0
1  4.0
2  7.0

fillna

The fillna method can be used to fill in missing data with a specified value or method. For example, to fill in missing values with zero:

data.fillna(0)
     0    1    2    3
0  1.0  2.0  3.0  0.0
1  4.0  0.0  6.0  0.0
2  7.0  8.0  0.0  0.0

You can also use different fill values for different columns by passing a dictionary:

data.fillna({1: 0.5, 2: 99})
     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  0.5   6.0 NaN
2  7.0  8.0  99.0 NaN

The method keyword can be used to specify a method for filling in missing values. The two most common methods are ffill (forward fill) and bfill (backward fill).

Forward fill propagates the last valid observation forward to the next valid. Backward fill uses the next valid observation to fill the gap.

data.fillna(method='ffill')
     0    1    2    3
0  1.0  2.0  3.0  NaN
1  4.0  2.0  6.0  NaN
2  7.0  8.0  6.0  NaN
data.fillna(method='bfill')
     0    1    2   3
0  1.0  2.0  3.0 NaN
1  4.0  8.0  6.0 NaN
2  7.0  8.0  NaN NaN