Label and One-Hot Encoding

In machine learning and data preprocessing, categorical variables often need to be converted into a numerical format that algorithms can understand. Two common techniques for this conversion are Label Encoding and One-Hot Encoding.

Encoding Ordinal Features: Label Encoding

Ordinal features can be encoded using a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL).

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)
data.head()

	id	zip	phone	lat	lng	type	score	risk	violation
0	70064	94103.0	1.415565e+10	NaN	NaN	Routine - Unscheduled	75.0	High Risk	Improper reheating of food
1	90039	94103.0	NaN	NaN	NaN	Routine - Unscheduled	81.0	High Risk	High risk food holding temperature
2	89059	94115.0	1.415369e+10	NaN	NaN	Complaint	NaN	NaN	NaN
3	91044	94112.0	NaN	NaN	NaN	Routine - Unscheduled	84.0	Moderate Risk	Inadequate and inaccessible handwashing facili...
4	62768	94122.0	NaN	37.765421	-122.477256	Routine - Unscheduled	90.0	Low Risk	Food safety certificate or food handler card n...

data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})
data.head()

Encoding Nominal Features: One-Hot Encoding

Since nominal features don’t have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called one-hot encoding. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value.

The figure below illustrates how one-hot encoding for a “day” (of the week) column:

pandas has a built-in .get_dummies function for doing this:

pd.get_dummies(data['type']).head()

	Complaint	Routine - Unscheduled
0	0	1
1	0	1
2	1	0
3	0	1
4	0	1