Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.
The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.
We call this a sentinel value: when present, it indicates a missing (or null) value.
The isna method gives us a Boolean Series with True where values are null:
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)data.head()
id
zip
phone
lat
lng
type
score
risk
violation
0
70064
94103.0
1.415565e+10
NaN
NaN
Routine - Unscheduled
75.0
High Risk
Improper reheating of food
1
90039
94103.0
NaN
NaN
NaN
Routine - Unscheduled
81.0
High Risk
High risk food holding temperature
2
89059
94115.0
1.415369e+10
NaN
NaN
Complaint
NaN
NaN
NaN
3
91044
94112.0
NaN
NaN
NaN
Routine - Unscheduled
84.0
Moderate Risk
Inadequate and inaccessible handwashing facili...
4
62768
94122.0
NaN
37.765421
-122.477256
Routine - Unscheduled
90.0
Low Risk
Food safety certificate or food handler card n...
data.isna().apply(lambda x: sum(x))
id 0
zip 1
phone 27
lat 30
lng 30
type 0
score 15
risk 17
violation 17
dtype: int64
In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
The built-in Python None value is also treated as NA.
Method
Description
dropna
Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna
Fill in missing data with some value
isna
Return Boolean values indicating which values are missing/NA.
notna
Negation of isna, returns True for non-NA values and False for NA values.
isna and notna
The isna and notna functions should be used to detect missing values. These functions are available as top-level methods in the pandas namespace, as well as instance methods on Series and DataFrame objects.
import pandas as pdimport numpy as npdata = pd.Series([1, np.nan, 'hello', None])data.isna()
0 False
1 True
2 False
3 True
dtype: bool
data.notna()
0 True
1 False
2 True
3 False
dtype: bool
These functions can also be used with DataFrame objects:
The dropna method can be used to drop rows or columns with missing data. By default, it drops any row containing a missing value:
data.dropna()
0 1 2
0 1.0 2.0 3.0
To drop columns with missing data instead of rows, use the axis keyword:
data.dropna(axis='columns')
0
0 1.0
1 4.0
2 7.0
By default, dropna drops rows or columns that contain any missing data. You can change this to drop rows or columns that contain all missing data by passing the how keyword:
data[3] = np.nandata
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
data.dropna(how='all')
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
data.dropna(axis='columns', how='all')
0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 6.0
2 7.0 8.0 NaN
Dropping rows or columns with missing data is a common operation when cleaning up data for analysis. However, it is important to consider whether this is appropriate for your data, as it can introduce bias if the missing data is not random.
You can also specify a threshold for the minimum number of non-missing values required to keep a row or column using the thresh keyword:
data
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
2 7.0 8.0 NaN NaN
data.dropna(thresh=3)
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 NaN 6.0 NaN
data.dropna(axis='columns', thresh=3)
0
0 1.0
1 4.0
2 7.0
fillna
The fillna method can be used to fill in missing data with a specified value or method. For example, to fill in missing values with zero:
You can also use different fill values for different columns by passing a dictionary:
data.fillna({1: 0.5, 2: 99})
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 0.5 6.0 NaN
2 7.0 8.0 99.0 NaN
The method keyword can be used to specify a method for filling in missing values. The two most common methods are ffill (forward fill) and bfill (backward fill).
Forward fill propagates the last valid observation forward to the next valid. Backward fill uses the next valid observation to fill the gap.
data.fillna(method='ffill')
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 2.0 6.0 NaN
2 7.0 8.0 6.0 NaN
data.fillna(method='bfill')
0 1 2 3
0 1.0 2.0 3.0 NaN
1 4.0 8.0 6.0 NaN
2 7.0 8.0 NaN NaN