Feature Types

Tabular data (pd.DataFrame), as discussed previously, is made up of observations (rows) and features (columns). Data type (df.dtypes) of features fall into two primary categories: numeric and categorical.

There also exists a third special category of data type called missing. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by NaN (not a number) in pandas. More on missing data in a bit.

To study these feature types, we will use the dataset of food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health.

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)
data.head()
id zip phone lat lng type score risk violation
0 70064 94103.0 1.415565e+10 NaN NaN Routine - Unscheduled 75.0 High Risk Improper reheating of food
1 90039 94103.0 NaN NaN NaN Routine - Unscheduled 81.0 High Risk High risk food holding temperature
2 89059 94115.0 1.415369e+10 NaN NaN Complaint NaN NaN NaN
3 91044 94112.0 NaN NaN NaN Routine - Unscheduled 84.0 Moderate Risk Inadequate and inaccessible handwashing facili...
4 62768 94122.0 NaN 37.765421 -122.477256 Routine - Unscheduled 90.0 Low Risk Food safety certificate or food handler card n...

Numerical Features

Numeric data is data that can be represented as numbers. These variables generally describe some numeric quantity or amount and are also sometimes referred to as “quantitative” variables.

Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.

In the example above, numerical features include zip, phone, lat, lng, score.

data[['zip', 'phone', 'lat', 'lng', 'score']].head()
zip phone lat lng score
0 94105 NaN 37.787925 -122.400953 82.0
1 94109 NaN 37.786108 -122.425764 NaN
2 94115 NaN 37.791607 -122.434563 82.0
3 94115 NaN 37.788932 -122.433895 78.0
4 94110 NaN 37.739161 -122.416967 94.0

Discrete Features

Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.

In the restaurants inspection data set, zip, phone, score are discrete features.

data[['zip', 'phone', 'score']].head()
zip phone score
0 94105 NaN 82.0
1 94109 NaN NaN
2 94115 NaN 82.0
3 94115 NaN 78.0
4 94110 NaN 94.0

Continuous Features

Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.

In the restaurants inspection data set, lat, lng are continuous features.

data[['lat', 'lng']].head()
lat lng
0 37.787925 -122.400953
1 37.786108 -122.425764
2 37.791607 -122.434563
3 37.788932 -122.433895
4 37.739161 -122.416967

Categorical Features

Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some characteristic or quality of a data unit, and are also sometimes referred to as “qualitative” variables.

data[['type', 'risk', 'violation']].head()
type risk violation
0 Routine - Unscheduled High Risk High risk food holding temperature
1 Complaint NaN NaN
2 Routine - Unscheduled Low Risk Inadequate warewashing facilities or equipment
3 Routine - Unscheduled Low Risk Improper food storage
4 Routine - Unscheduled Low Risk Unapproved or unmaintained equipment or utensils

Ordinal Features

Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.

data[['risk']].head()
risk
0 High Risk
1 NaN
2 Low Risk
3 Low Risk
4 Low Risk

Nominal Features

Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different.

data[['type', 'violation']].head()
type violation
0 Routine - Unscheduled High risk food holding temperature
1 Complaint NaN
2 Routine - Unscheduled Inadequate warewashing facilities or equipment
3 Routine - Unscheduled Improper food storage
4 Routine - Unscheduled Unapproved or unmaintained equipment or utensils

.dtypes attribute

.dtypes is an attribute of a DataFrame that returns the data type of each column. The data types are returned as a Series with the column names as the index labels.

data.dtypes
id             int64
zip          float64
phone        float64
lat          float64
lng          float64
type          object
score        float64
risk          object
violation     object
dtype: object

In pandas, object is the data type used for string columns, while int64 and float64 are used for integer and floating-point columns, respectively.

.astype()

Cast a pandas object to a specified dtype

data.head()
id zip phone lat lng type score risk violation
0 70064 94103.0 1.415565e+10 NaN NaN Routine - Unscheduled 75.0 High Risk Improper reheating of food
1 90039 94103.0 NaN NaN NaN Routine - Unscheduled 81.0 High Risk High risk food holding temperature
2 89059 94115.0 1.415369e+10 NaN NaN Complaint NaN NaN NaN
3 91044 94112.0 NaN NaN NaN Routine - Unscheduled 84.0 Moderate Risk Inadequate and inaccessible handwashing facili...
4 62768 94122.0 NaN 37.765421 -122.477256 Routine - Unscheduled 90.0 Low Risk Food safety certificate or food handler card n...
data['zip'].astype(int)