apply(lambda x: sum(x)) data.isna().
id 0
zip 1
phone 27
lat 30
lng 30
type 0
score 15
risk 17
violation 17
dtype: int64
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.
The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64
dtype, pandas uses the floating-point value NaN
(Not a Number) to represent missing data.
We call this a sentinel value: when present, it indicates a missing (or null) value.
The isna method gives us a Boolean Series with True
where values are null:
id 0
zip 1
phone 27
lat 30
lng 30
type 0
score 15
risk 17
violation 17
dtype: int64
In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
The built-in Python None
value is also treated as NA.
Method | Description |
---|---|
dropna |
Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
fillna |
Fill in missing data with some value |
isna |
Return Boolean values indicating which values are missing/NA. |
notna |
Negation of isna , returns True for non-NA values and False for NA values. |
isna
and notna
The isna
and notna
functions should be used to detect missing values. These functions are available as top-level methods in the pandas namespace, as well as instance methods on Series and DataFrame objects.
0 False
1 True
2 False
3 True
dtype: bool
0 True
1 False
2 True
3 False
dtype: bool
These functions can also be used with DataFrame objects:
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 6.0
2 7.0 8.0 NaN
0 1 2
0 False False False
1 False True False
2 False False True
0 1 2
0 True True True
1 True False True
2 True True False
dropna
The dropna
method can be used to drop rows or columns with missing data. By default, it drops any row containing a missing value:
0 1 2
0 1.0 2.0 3.0
To drop columns with missing data instead of rows, use the axis
keyword:
0
0 1.0
1 4.0
2 7.0
By default, dropna
drops rows or columns that contain any missing data. You can change this to drop rows or columns that contain all missing data by passing the how
keyword:
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 6.0
2 7.0 8.0 NaN
Dropping rows or columns with missing data is a common operation when cleaning up data for analysis. However, it is important to consider whether this is appropriate for your data, as it can introduce bias if the missing data is not random.
You can also specify a threshold for the minimum number of non-missing values required to keep a row or column using the thresh
keyword:
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
0
0 1.0
1 4.0
2 7.0
fillna
The fillna
method can be used to fill in missing data with a specified value or method. For example, to fill in missing values with zero:
0 1 2 3
0 1.0 2.0 3.0 0.0
1 4.0 0.0 6.0 0.0
2 7.0 8.0 0.0 0.0
You can also use different fill values for different columns by passing a dictionary:
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 0.5 6.0 NaN
2 7.0 8.0 99.0 NaN
The method
keyword can be used to specify a method for filling in missing values. The two most common methods are ffill
(forward fill) and bfill
(backward fill).
Forward fill propagates the last valid observation forward to the next valid. Backward fill uses the next valid observation to fill the gap.
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 2.0 6.0 NaN
2 7.0 8.0 6.0 NaN
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 8.0 6.0 NaN
2 7.0 8.0 NaN NaN