Label and One-Hot Encoding

In machine learning and data preprocessing, categorical variables often need to be converted into a numerical format that algorithms can understand. Two common techniques for this conversion are Label Encoding and One-Hot Encoding.

Encoding Ordinal Features: Label Encoding

Ordinal features can be encoded using a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL).

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)
data.head()
id zip phone lat lng type score risk violation
0 70064 94103.0 1.415565e+10 NaN NaN Routine - Unscheduled 75.0 High Risk Improper reheating of food
1 90039 94103.0 NaN NaN NaN Routine - Unscheduled 81.0 High Risk High risk food holding temperature
2 89059 94115.0 1.415369e+10 NaN NaN Complaint NaN NaN NaN
3 91044 94112.0 NaN NaN NaN Routine - Unscheduled 84.0 Moderate Risk Inadequate and inaccessible handwashing facili...
4 62768 94122.0 NaN 37.765421 -122.477256 Routine - Unscheduled 90.0 Low Risk Food safety certificate or food handler card n...
data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})
data.head()

Encoding Nominal Features: One-Hot Encoding

Since nominal features don’t have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called one-hot encoding. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value.

The figure below illustrates how one-hot encoding for a “day” (of the week) column:


pandas has a built-in .get_dummies function for doing this:

pd.get_dummies(data['type']).head()
Complaint New Construction New Ownership Reinspection/Followup Routine - Unscheduled
0 0 0 0 0 1
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 0 0 1
4 0 0 0 0 1