In machine learning and data preprocessing, categorical variables often need to be converted into a numerical format that algorithms can understand. Two common techniques for this conversion are Label Encoding and One-Hot Encoding.
Encoding Ordinal Features: Label Encoding
Ordinal features can be encoded using a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL).
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)data.head()
Since nominal features don’t have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called one-hot encoding. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value.
The figure below illustrates how one-hot encoding for a “day” (of the week) column:
pandas has a built-in .get_dummies function for doing this: