Tabular data (pd.DataFrame), as discussed previously, is made up of observations (rows) and features (columns). Data type (df.dtypes) of features fall into two primary categories: numeric and categorical.
There also exists a third special category of data type called missing. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by NaN (not a number) in pandas. More on missing data in a bit.
To study these feature types, we will use the dataset of food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)data.head()
id
zip
phone
lat
lng
type
score
risk
violation
0
70064
94103.0
1.415565e+10
NaN
NaN
Routine - Unscheduled
75.0
High Risk
Improper reheating of food
1
90039
94103.0
NaN
NaN
NaN
Routine - Unscheduled
81.0
High Risk
High risk food holding temperature
2
89059
94115.0
1.415369e+10
NaN
NaN
Complaint
NaN
NaN
NaN
3
91044
94112.0
NaN
NaN
NaN
Routine - Unscheduled
84.0
Moderate Risk
Inadequate and inaccessible handwashing facili...
4
62768
94122.0
NaN
37.765421
-122.477256
Routine - Unscheduled
90.0
Low Risk
Food safety certificate or food handler card n...
Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.
The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.
We call this a sentinel value: when present, it indicates a missing (or null) value.
The isna method gives us a Boolean Series with True where values are null:
data.isna().apply(lambda x: sum(x))
id 0
zip 1
phone 27
lat 30
lng 30
type 0
score 15
risk 17
violation 17
dtype: int64
In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
The built-in Python None value is also treated as NA.
Method
Description
dropna
Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna
Fill in missing data with some value
isna
Return Boolean values indicating which values are missing/NA.
notna
Negation of isna, returns True for non-NA values and False for NA values.
Numerical Features
Numeric data is data that can be represented as numbers. These variables generally describe some numeric quantity or amount and are also sometimes referred to as “quantitative” variables.
Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.
In the example above, numerical features include zip, phone, lat, lng, score.
Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.
In the restaurants inspection data set, zip, phone, score are discrete features.
data[['zip', 'phone', 'score']].head()
zip
phone
score
0
94105
NaN
82.0
1
94109
NaN
NaN
2
94115
NaN
82.0
3
94115
NaN
78.0
4
94110
NaN
94.0
Continuous Features
Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.
In the restaurants inspection data set, lat, lng are continuous features.
data[['lat', 'lng']].head()
lat
lng
0
37.787925
-122.400953
1
37.786108
-122.425764
2
37.791607
-122.434563
3
37.788932
-122.433895
4
37.739161
-122.416967
Categorical Features
Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some characteristic or quality of a data unit, and are also sometimes referred to as “qualitative” variables.
data[['type', 'risk', 'violation']].head()
type
risk
violation
0
Routine - Unscheduled
High Risk
High risk food holding temperature
1
Complaint
NaN
NaN
2
Routine - Unscheduled
Low Risk
Inadequate warewashing facilities or equipment
3
Routine - Unscheduled
Low Risk
Improper food storage
4
Routine - Unscheduled
Low Risk
Unapproved or unmaintained equipment or utensils
Ordinal Features
Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.
data[['risk']].head()
risk
0
High Risk
1
NaN
2
Low Risk
3
Low Risk
4
Low Risk
Encoding Ordinal Features
Ordinal features can be encoded using a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL).
Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different.
data[['type', 'violation']].head()
type
violation
0
Routine - Unscheduled
High risk food holding temperature
1
Complaint
NaN
2
Routine - Unscheduled
Inadequate warewashing facilities or equipment
3
Routine - Unscheduled
Improper food storage
4
Routine - Unscheduled
Unapproved or unmaintained equipment or utensils
Encoding Nominal Features
Since nominal features don’t have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called one-hot encoding. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value.
The figure below illustrates how one-hot encoding for a “day” (of the week) column:
pandas has a built-in .get_dummies function for doing this:
pd.get_dummies(data['type']).head()
Complaint
New Construction
New Ownership
Reinspection/Followup
Routine - Unscheduled
0
0
0
0
0
1
1
0
0
0
0
1
2
1
0
0
0
0
3
0
0
0
0
1
4
0
0
0
0
1
.dtypes attribute
.dtypes is an attribute of a DataFrame that returns the data type of each column. The data types are returned as a Series with the column names as the index labels.
data.dtypes
id int64
zip float64
phone float64
lat float64
lng float64
type object
score float64
risk object
violation object
dtype: object
In pandas, object is the data type used for string columns, while int64 and float64 are used for integer and floating-point columns, respectively.
.astype()
Cast a pandas object to a specified dtype
data.head()
id
zip
phone
lat
lng
type
score
risk
violation
0
70064
94103.0
1.415565e+10
NaN
NaN
Routine - Unscheduled
75.0
High Risk
Improper reheating of food
1
90039
94103.0
NaN
NaN
NaN
Routine - Unscheduled
81.0
High Risk
High risk food holding temperature
2
89059
94115.0
1.415369e+10
NaN
NaN
Complaint
NaN
NaN
NaN
3
91044
94112.0
NaN
NaN
NaN
Routine - Unscheduled
84.0
Moderate Risk
Inadequate and inaccessible handwashing facili...
4
62768
94122.0
NaN
37.765421
-122.477256
Routine - Unscheduled
90.0
Low Risk
Food safety certificate or food handler card n...
data['zip'].astype(int)
---------------------------------------------------------------------------IntCastingNaNError Traceback (most recent call last)
Cell In[7], line 1----> 1data['zip'].astype(int)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:5912, in NDFrame.astype(self, dtype, copy, errors) 5905 results = [
5906self.iloc[:, i].astype(dtype, copy=copy)
5907for i inrange(len(self.columns))
5908 ]
5910else:
5911# else, only a single dtype is given-> 5912 new_data =self._mgr.astype(dtype=dtype,copy=copy,errors=errors) 5913returnself._constructor(new_data).__finalize__(self, method="astype")
5915# GH 33113: handle empty frame or series
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:419, in BaseBlockManager.astype(self, dtype, copy, errors) 418defastype(self: T, dtype, copy: bool=False, errors: str="raise") -> T:
--> 419returnself.apply("astype",dtype=dtype,copy=copy,errors=errors)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:304, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs) 302 applied = b.apply(f, **kwargs)
303else:
--> 304 applied =getattr(b,f)(**kwargs) 305except (TypeError, NotImplementedError):
306ifnot ignore_failures:
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/blocks.py:580, in Block.astype(self, dtype, copy, errors) 562""" 563Coerce to the new dtype. 564 (...) 576Block 577""" 578 values =self.values
--> 580 new_values =astype_array_safe(values,dtype,copy=copy,errors=errors) 582 new_values = maybe_coerce_values(new_values)
583 newb =self.make_block(new_values)
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1292, in astype_array_safe(values, dtype, copy, errors) 1289 dtype = dtype.numpy_dtype
1291try:
-> 1292 new_values =astype_array(values,dtype,copy=copy) 1293except (ValueError, TypeError):
1294# e.g. astype_nansafe can fail on object-dtype of strings 1295# trying to convert to float 1296if errors =="ignore":
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1237, in astype_array(values, dtype, copy) 1234 values = values.astype(dtype, copy=copy)
1236else:
-> 1237 values =astype_nansafe(values,dtype,copy=copy) 1239# in pandas we don't store numpy str dtypes, so convert to object 1240ifisinstance(dtype, np.dtype) andissubclass(values.dtype.type, str):
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1148, in astype_nansafe(arr, dtype, copy, skipna) 1145raiseTypeError(f"cannot astype a timedelta from [{arr.dtype}] to [{dtype}]")
1147elif np.issubdtype(arr.dtype, np.floating) and np.issubdtype(dtype, np.integer):
-> 1148returnastype_float_to_int_nansafe(arr,dtype,copy) 1150elif is_object_dtype(arr.dtype):
1151 1152# work around NumPy brokenness, #1987 1153if np.issubdtype(dtype.type, np.integer):
File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1193, in astype_float_to_int_nansafe(values, dtype, copy) 1189""" 1190astype with a check preventing converting NaN to an meaningless integer value. 1191""" 1192ifnot np.isfinite(values).all():
-> 1193raise IntCastingNaNError(
1194"Cannot convert non-finite values (NA or inf) to integer" 1195 )
1196return values.astype(dtype, copy=copy)
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer