Feature Types

Tabular data (pd.DataFrame), as discussed previously, is made up of observations (rows) and features (columns). Data type (df.dtypes) of features fall into two primary categories: numeric and categorical.

There also exists a third special category of data type called missing. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by NaN (not a number) in pandas. More on missing data in a bit.

To study these feature types, we will use the dataset of food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health.

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)
data.head()

	id	zip	phone	lat	lng	type	score	risk	violation
0	70064	94103.0	1.415565e+10	NaN	NaN	Routine - Unscheduled	75.0	High Risk	Improper reheating of food
1	90039	94103.0	NaN	NaN	NaN	Routine - Unscheduled	81.0	High Risk	High risk food holding temperature
2	89059	94115.0	1.415369e+10	NaN	NaN	Complaint	NaN	NaN	NaN
3	91044	94112.0	NaN	NaN	NaN	Routine - Unscheduled	84.0	Moderate Risk	Inadequate and inaccessible handwashing facili...
4	62768	94122.0	NaN	37.765421	-122.477256	Routine - Unscheduled	90.0	Low Risk	Food safety certificate or food handler card n...

Missing Data

Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

We call this a sentinel value: when present, it indicates a missing (or null) value.

The isna method gives us a Boolean Series with True where values are null:

data.isna().apply(lambda x: sum(x))

id            0
zip           1
phone        27
lat          30
lng          30
type          0
score        15
risk         17
violation    17
dtype: int64

In pandas, missing data is also refered to as NA, which stands for Not Available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA.

Method	Description
`dropna`	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
`fillna`	Fill in missing data with some value
`isna`	Return Boolean values indicating which values are missing/NA.
`notna`	Negation of `isna`, returns `True` for non-NA values and `False` for NA values.

Numerical Features

Numeric data is data that can be represented as numbers. These variables generally describe some numeric quantity or amount and are also sometimes referred to as “quantitative” variables.

Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.

In the example above, numerical features include zip, phone, lat, lng, score.

data[['zip', 'phone', 'lat', 'lng', 'score']].head()

	zip	phone	lat	lng	score
0	94105	NaN	37.787925	-122.400953	82.0
1	94109	NaN	37.786108	-122.425764	NaN
2	94115	NaN	37.791607	-122.434563	82.0
3	94115	NaN	37.788932	-122.433895	78.0
4	94110	NaN	37.739161	-122.416967	94.0

Discrete Features

Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.

In the restaurants inspection data set, zip, phone, score are discrete features.

data[['zip', 'phone', 'score']].head()

	zip	phone	score
0	94105	NaN	82.0
1	94109	NaN	NaN
2	94115	NaN	82.0
3	94115	NaN	78.0
4	94110	NaN	94.0

Continuous Features

Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.

In the restaurants inspection data set, lat, lng are continuous features.

data[['lat', 'lng']].head()

	lat	lng
0	37.787925	-122.400953
1	37.786108	-122.425764
2	37.791607	-122.434563
3	37.788932	-122.433895
4	37.739161	-122.416967

Categorical Features

Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some characteristic or quality of a data unit, and are also sometimes referred to as “qualitative” variables.

data[['type', 'risk', 'violation']].head()

	type	risk	violation
0	Routine - Unscheduled	High Risk	High risk food holding temperature
1	Complaint	NaN	NaN
2	Routine - Unscheduled	Low Risk	Inadequate warewashing facilities or equipment
3	Routine - Unscheduled	Low Risk	Improper food storage
4	Routine - Unscheduled	Low Risk	Unapproved or unmaintained equipment or utensils

Ordinal Features

Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.

data[['risk']].head()

	risk
0	High Risk
1	NaN
2	Low Risk
3	Low Risk
4	Low Risk

Encoding Ordinal Features

Ordinal features can be encoded using a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL).

data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})
data.head()

	id	zip	phone	lat	lng	type	score	risk	violation	risk_enc
0	64454	94105	NaN	37.787925	-122.400953	Routine - Unscheduled	82.0	High Risk	High risk food holding temperature	2.0
1	33014	94109	NaN	37.786108	-122.425764	Complaint	NaN	NaN	NaN	NaN
2	1526	94115	NaN	37.791607	-122.434563	Routine - Unscheduled	82.0	Low Risk	Inadequate warewashing facilities or equipment	0.0
3	73	94115	NaN	37.788932	-122.433895	Routine - Unscheduled	78.0	Low Risk	Improper food storage	0.0
4	66402	94110	NaN	37.739161	-122.416967	Routine - Unscheduled	94.0	Low Risk	Unapproved or unmaintained equipment or utensils	0.0

Nominal Features

Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different.

data[['type', 'violation']].head()

	type	violation
0	Routine - Unscheduled	High risk food holding temperature
1	Complaint	NaN
2	Routine - Unscheduled	Inadequate warewashing facilities or equipment
3	Routine - Unscheduled	Improper food storage
4	Routine - Unscheduled	Unapproved or unmaintained equipment or utensils

Encoding Nominal Features

Since nominal features don’t have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called one-hot encoding. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value.

The figure below illustrates how one-hot encoding for a “day” (of the week) column:

pandas has a built-in .get_dummies function for doing this:

pd.get_dummies(data['type']).head()

	Complaint	Routine - Unscheduled
0	0	1
1	0	1
2	1	0
3	0	1
4	0	1

`.dtypes` attribute

.dtypes is an attribute of a DataFrame that returns the data type of each column. The data types are returned as a Series with the column names as the index labels.

data.dtypes

id             int64
zip          float64
phone        float64
lat          float64
lng          float64
type          object
score        float64
risk          object
violation     object
dtype: object

In pandas, object is the data type used for string columns, while int64 and float64 are used for integer and floating-point columns, respectively.

`.astype()`

Cast a pandas object to a specified dtype

data.head()

	id	zip	phone	lat	lng	type	score	risk	violation
0	70064	94103.0	1.415565e+10	NaN	NaN	Routine - Unscheduled	75.0	High Risk	Improper reheating of food
1	90039	94103.0	NaN	NaN	NaN	Routine - Unscheduled	81.0	High Risk	High risk food holding temperature
2	89059	94115.0	1.415369e+10	NaN	NaN	Complaint	NaN	NaN	NaN
3	91044	94112.0	NaN	NaN	NaN	Routine - Unscheduled	84.0	Moderate Risk	Inadequate and inaccessible handwashing facili...
4	62768	94122.0	NaN	37.765421	-122.477256	Routine - Unscheduled	90.0	Low Risk	Food safety certificate or food handler card n...

data['zip'].astype(int)

---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[7], line 1
----> 1 data['zip'].astype(int)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py:5912, in NDFrame.astype(self, dtype, copy, errors)
   5905     results = [
   5906         self.iloc[:, i].astype(dtype, copy=copy)
   5907         for i in range(len(self.columns))
   5908     ]
   5910 else:
   5911     # else, only a single dtype is given
-> 5912     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   5913     return self._constructor(new_data).__finalize__(self, method="astype")
   5915 # GH 33113: handle empty frame or series

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:419, in BaseBlockManager.astype(self, dtype, copy, errors)
    418 def astype(self: T, dtype, copy: bool = False, errors: str = "raise") -> T:
--> 419     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py:304, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    302         applied = b.apply(f, **kwargs)
    303     else:
--> 304         applied = getattr(b, f)(**kwargs)
    305 except (TypeError, NotImplementedError):
    306     if not ignore_failures:

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/blocks.py:580, in Block.astype(self, dtype, copy, errors)
    562 """
    563 Coerce to the new dtype.
    564 
   (...)
    576 Block
    577 """
    578 values = self.values
--> 580 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    582 new_values = maybe_coerce_values(new_values)
    583 newb = self.make_block(new_values)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1292, in astype_array_safe(values, dtype, copy, errors)
   1289     dtype = dtype.numpy_dtype
   1291 try:
-> 1292     new_values = astype_array(values, dtype, copy=copy)
   1293 except (ValueError, TypeError):
   1294     # e.g. astype_nansafe can fail on object-dtype of strings
   1295     #  trying to convert to float
   1296     if errors == "ignore":

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1237, in astype_array(values, dtype, copy)
   1234     values = values.astype(dtype, copy=copy)
   1236 else:
-> 1237     values = astype_nansafe(values, dtype, copy=copy)
   1239 # in pandas we don't store numpy str dtypes, so convert to object
   1240 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1148, in astype_nansafe(arr, dtype, copy, skipna)
   1145     raise TypeError(f"cannot astype a timedelta from [{arr.dtype}] to [{dtype}]")
   1147 elif np.issubdtype(arr.dtype, np.floating) and np.issubdtype(dtype, np.integer):
-> 1148     return astype_float_to_int_nansafe(arr, dtype, copy)
   1150 elif is_object_dtype(arr.dtype):
   1151 
   1152     # work around NumPy brokenness, #1987
   1153     if np.issubdtype(dtype.type, np.integer):

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1193, in astype_float_to_int_nansafe(values, dtype, copy)
   1189 """
   1190 astype with a check preventing converting NaN to an meaningless integer value.
   1191 """
   1192 if not np.isfinite(values).all():
-> 1193     raise IntCastingNaNError(
   1194         "Cannot convert non-finite values (NA or inf) to integer"
   1195     )
   1196 return values.astype(dtype, copy=copy)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer