Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

We call this a sentinel value: when present, it indicates a missing (or null) value.

The isna method gives us a Boolean Series with True where values are null:

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)
data.head()

	id	zip	phone	lat	lng	type	score	risk	violation
0	70064	94103.0	1.415565e+10	NaN	NaN	Routine - Unscheduled	75.0	High Risk	Improper reheating of food
1	90039	94103.0	NaN	NaN	NaN	Routine - Unscheduled	81.0	High Risk	High risk food holding temperature
2	89059	94115.0	1.415369e+10	NaN	NaN	Complaint	NaN	NaN	NaN
3	91044	94112.0	NaN	NaN	NaN	Routine - Unscheduled	84.0	Moderate Risk	Inadequate and inaccessible handwashing facili...
4	62768	94122.0	NaN	37.765421	-122.477256	Routine - Unscheduled	90.0	Low Risk	Food safety certificate or food handler card n...

data.isna().apply(lambda x: sum(x))

id            0
zip           1
phone        27
lat          30
lng          30
type          0
score        15
risk         17
violation    17
dtype: int64

In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA.

Method	Description
`dropna`	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna`	Fill in missing data with some value
`isna`	Return Boolean values indicating which values are missing/NA.
`notna`	Negation of `isna`, returns `True` for non-NA values and `False` for NA values.

`isna` and `notna`

The isna and notna functions should be used to detect missing values. These functions are available as top-level methods in the pandas namespace, as well as instance methods on Series and DataFrame objects.

import pandas as pd
import numpy as np

data = pd.Series([1, np.nan, 'hello', None])

data.isna()

0    False
1     True
2    False
3     True
dtype: bool

data.notna()

0     True
1    False
2     True
3    False
dtype: bool

These functions can also be used with DataFrame objects:

data = pd.DataFrame([[1, 2, 3], [4, np.nan, 6], [7, 8, None]])
data

     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   6.0
2  7.0  8.0   NaN

data.isna()

       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True

data.notna()

       0      1      2
0   True   True   True
1   True  False   True
2   True   True  False

`dropna`

The dropna method can be used to drop rows or columns with missing data. By default, it drops any row containing a missing value:

data.dropna()

     0    1    2
0  1.0  2.0  3.0

To drop columns with missing data instead of rows, use the axis keyword:

data.dropna(axis='columns')

By default, dropna drops rows or columns that contain any missing data. You can change this to drop rows or columns that contain all missing data by passing the how keyword:

data[3] = np.nan
data

     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN

data.dropna(how='all')

     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN

data.dropna(axis='columns', how='all')

     0    1     2
0  1.0  2.0   3.0
1  4.0  NaN   6.0
2  7.0  8.0   NaN

Dropping rows or columns with missing data is a common operation when cleaning up data for analysis. However, it is important to consider whether this is appropriate for your data, as it can introduce bias if the missing data is not random.

You can also specify a threshold for the minimum number of non-missing values required to keep a row or column using the thresh keyword:

data

     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  NaN   6.0 NaN
2  7.0  8.0   NaN NaN

data.dropna(thresh=3)

     0    1    2   3
0  1.0  2.0  3.0 NaN
1  4.0  NaN  6.0 NaN

data.dropna(axis='columns', thresh=3)

`fillna`

The fillna method can be used to fill in missing data with a specified value or method. For example, to fill in missing values with zero:

data.fillna(0)

     0    1    2    3
0  1.0  2.0  3.0  0.0
1  4.0  0.0  6.0  0.0
2  7.0  8.0  0.0  0.0

You can also use different fill values for different columns by passing a dictionary:

data.fillna({1: 0.5, 2: 99})

     0    1     2   3
0  1.0  2.0   3.0 NaN
1  4.0  0.5   6.0 NaN
2  7.0  8.0  99.0 NaN

The method keyword can be used to specify a method for filling in missing values. The two most common methods are ffill (forward fill) and bfill (backward fill).

Forward fill propagates the last valid observation forward to the next valid. Backward fill uses the next valid observation to fill the gap.

data.fillna(method='ffill')

     0    1    2    3
0  1.0  2.0  3.0  NaN
1  4.0  2.0  6.0  NaN
2  7.0  8.0  6.0  NaN

data.fillna(method='bfill')

     0    1    2   3
0  1.0  2.0  3.0 NaN
1  4.0  8.0  6.0 NaN
2  7.0  8.0  NaN NaN

Missing Data

isna and notna

dropna

fillna

`isna` and `notna`

`dropna`

`fillna`