{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tabular data (`pd.DataFrame`), as discussed previously, is made up of observations (rows) and features (columns). Data type (`df.dtypes`) of features fall into two primary categories: **numeric** and **categorical**.\n",
    "\n",
    "There also exists a third special category of data type called **missing**. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by `NaN` (not a number) in pandas. More on missing data in a bit.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/featuretypes.png\n",
    "---\n",
    "width: 100%\n",
    "name: directive-fig\n",
    "---\n",
    "Classification of feature types\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To study these feature types, we will use the **dataset of food safety scores** for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>zip</th>\n",
       "      <th>phone</th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "      <th>type</th>\n",
       "      <th>score</th>\n",
       "      <th>risk</th>\n",
       "      <th>violation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>70064</td>\n",
       "      <td>94103.0</td>\n",
       "      <td>1.415565e+10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>75.0</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>Improper reheating of food</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>90039</td>\n",
       "      <td>94103.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>81.0</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>High risk food holding temperature</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>89059</td>\n",
       "      <td>94115.0</td>\n",
       "      <td>1.415369e+10</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Complaint</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>91044</td>\n",
       "      <td>94112.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>84.0</td>\n",
       "      <td>Moderate Risk</td>\n",
       "      <td>Inadequate and inaccessible handwashing facili...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>62768</td>\n",
       "      <td>94122.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.765421</td>\n",
       "      <td>-122.477256</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>90.0</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Food safety certificate or food handler card n...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id      zip         phone        lat         lng                   type  \\\n",
       "0  70064  94103.0  1.415565e+10        NaN         NaN  Routine - Unscheduled   \n",
       "1  90039  94103.0           NaN        NaN         NaN  Routine - Unscheduled   \n",
       "2  89059  94115.0  1.415369e+10        NaN         NaN              Complaint   \n",
       "3  91044  94112.0           NaN        NaN         NaN  Routine - Unscheduled   \n",
       "4  62768  94122.0           NaN  37.765421 -122.477256  Routine - Unscheduled   \n",
       "\n",
       "   score           risk                                          violation  \n",
       "0   75.0      High Risk                         Improper reheating of food  \n",
       "1   81.0      High Risk                 High risk food holding temperature  \n",
       "2    NaN            NaN                                                NaN  \n",
       "3   84.0  Moderate Risk  Inadequate and inaccessible handwashing facili...  \n",
       "4   90.0       Low Risk  Food safety certificate or food handler card n...  "
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "\n",
    "data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Missing Data\n",
    "\n",
    "Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.\n",
    "\n",
    "The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with `float64` dtype, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data.\n",
    "\n",
    "We call this a _sentinel value_: when present, it indicates a missing (or null) value.\n",
    "\n",
    "The isna method gives us a Boolean Series with `True` where values are null:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id            0\n",
       "zip           1\n",
       "phone        27\n",
       "lat          30\n",
       "lng          30\n",
       "type          0\n",
       "score        15\n",
       "risk         17\n",
       "violation    17\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "data.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas, missing data is also refered to as NA, which stands for _Not Available_. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.\n",
    "\n",
    "The built-in Python `None` value is also treated as NA.\n",
    "\n",
    "|    Method   |   Description   |\n",
    "| :------------ | -------------: |\n",
    "|        `dropna`   | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.       |\n",
    "|     `fillna`     |      Fill in missing data with some value    |\n",
    "| `isna`\t| Return Boolean values indicating which values are missing/NA. |\n",
    "| `notna`\t| Negation of `isna`, returns `True` for non-NA values and `False` for NA values. |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numerical Features\n",
    "\n",
    "Numeric data is data that can be represented as numbers. These variables generally describe some numeric _quantity_ or _amount_ and are also sometimes referred to as \"quantitative\" variables. \n",
    "\n",
    "Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.\n",
    "\n",
    "In the example above, numerical features include `zip`, `phone`, `lat`, `lng`, `score`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>zip</th>\n",
       "      <th>phone</th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>94105</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.787925</td>\n",
       "      <td>-122.400953</td>\n",
       "      <td>82.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>94109</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.786108</td>\n",
       "      <td>-122.425764</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.791607</td>\n",
       "      <td>-122.434563</td>\n",
       "      <td>82.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.788932</td>\n",
       "      <td>-122.433895</td>\n",
       "      <td>78.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>94110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.739161</td>\n",
       "      <td>-122.416967</td>\n",
       "      <td>94.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     zip  phone        lat         lng  score\n",
       "0  94105    NaN  37.787925 -122.400953   82.0\n",
       "1  94109    NaN  37.786108 -122.425764    NaN\n",
       "2  94115    NaN  37.791607 -122.434563   82.0\n",
       "3  94115    NaN  37.788932 -122.433895   78.0\n",
       "4  94110    NaN  37.739161 -122.416967   94.0"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['zip', 'phone', 'lat', 'lng', 'score']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discrete Features\n",
    "\n",
    "Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.\n",
    "\n",
    "In the restaurants inspection data set, `zip`, `phone`, `score` are discrete features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>zip</th>\n",
       "      <th>phone</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>94105</td>\n",
       "      <td>NaN</td>\n",
       "      <td>82.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>94109</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>82.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>78.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>94110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>94.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     zip  phone  score\n",
       "0  94105    NaN   82.0\n",
       "1  94109    NaN    NaN\n",
       "2  94115    NaN   82.0\n",
       "3  94115    NaN   78.0\n",
       "4  94110    NaN   94.0"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['zip', 'phone', 'score']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Continuous Features\n",
    "\n",
    "Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.\n",
    "\n",
    "In the restaurants inspection data set, `lat`, `lng` are continuous features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>37.787925</td>\n",
       "      <td>-122.400953</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>37.786108</td>\n",
       "      <td>-122.425764</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37.791607</td>\n",
       "      <td>-122.434563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>37.788932</td>\n",
       "      <td>-122.433895</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>37.739161</td>\n",
       "      <td>-122.416967</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         lat         lng\n",
       "0  37.787925 -122.400953\n",
       "1  37.786108 -122.425764\n",
       "2  37.791607 -122.434563\n",
       "3  37.788932 -122.433895\n",
       "4  37.739161 -122.416967"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['lat', 'lng']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical Features\n",
    "\n",
    "Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some _characteristic_ or _quality_ of a data unit, and are also sometimes referred to as \"qualitative\" variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>risk</th>\n",
       "      <th>violation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>High risk food holding temperature</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Complaint</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Inadequate warewashing facilities or equipment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Improper food storage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Unapproved or unmaintained equipment or utensils</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    type       risk  \\\n",
       "0  Routine - Unscheduled  High Risk   \n",
       "1              Complaint        NaN   \n",
       "2  Routine - Unscheduled   Low Risk   \n",
       "3  Routine - Unscheduled   Low Risk   \n",
       "4  Routine - Unscheduled   Low Risk   \n",
       "\n",
       "                                          violation  \n",
       "0                High risk food holding temperature  \n",
       "1                                               NaN  \n",
       "2    Inadequate warewashing facilities or equipment  \n",
       "3                             Improper food storage  \n",
       "4  Unapproved or unmaintained equipment or utensils  "
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['type', 'risk', 'violation']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ordinal Features\n",
    "\n",
    "Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>risk</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>High Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        risk\n",
       "0  High Risk\n",
       "1        NaN\n",
       "2   Low Risk\n",
       "3   Low Risk\n",
       "4   Low Risk"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[['risk']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Encoding Ordinal Features\n",
    "\n",
    "Ordinal features can be encoded using a technique called **label encoding**. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>zip</th>\n",
       "      <th>phone</th>\n",
       "      <th>lat</th>\n",
       "      <th>lng</th>\n",
       "      <th>type</th>\n",
       "      <th>score</th>\n",
       "      <th>risk</th>\n",
       "      <th>violation</th>\n",
       "      <th>risk_enc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>64454</td>\n",
       "      <td>94105</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.787925</td>\n",
       "      <td>-122.400953</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>82.0</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>High risk food holding temperature</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>33014</td>\n",
       "      <td>94109</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.786108</td>\n",
       "      <td>-122.425764</td>\n",
       "      <td>Complaint</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1526</td>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.791607</td>\n",
       "      <td>-122.434563</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>82.0</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Inadequate warewashing facilities or equipment</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>73</td>\n",
       "      <td>94115</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.788932</td>\n",
       "      <td>-122.433895</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>78.0</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Improper food storage</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>66402</td>\n",
       "      <td>94110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.739161</td>\n",
       "      <td>-122.416967</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>94.0</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>Unapproved or unmaintained equipment or utensils</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id    zip  phone        lat         lng                   type  score  \\\n",
       "0  64454  94105    NaN  37.787925 -122.400953  Routine - Unscheduled   82.0   \n",
       "1  33014  94109    NaN  37.786108 -122.425764              Complaint    NaN   \n",
       "2   1526  94115    NaN  37.791607 -122.434563  Routine - Unscheduled   82.0   \n",
       "3     73  94115    NaN  37.788932 -122.433895  Routine - Unscheduled   78.0   \n",
       "4  66402  94110    NaN  37.739161 -122.416967  Routine - Unscheduled   94.0   \n",
       "\n",
       "        risk                                         violation  risk_enc  \n",
       "0  High Risk                High risk food holding temperature       2.0  \n",
       "1        NaN                                               NaN       NaN  \n",
       "2   Low Risk    Inadequate warewashing facilities or equipment       0.0  \n",
       "3   Low Risk                             Improper food storage       0.0  \n",
       "4   Low Risk  Unapproved or unmaintained equipment or utensils       0.0  "
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nominal Features\n",
    "\n",
    "Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>violation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>High risk food holding temperature</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Complaint</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Inadequate warewashing facilities or equipment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Improper food storage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>Unapproved or unmaintained equipment or utensils</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    type                                         violation\n",
       "0  Routine - Unscheduled                High risk food holding temperature\n",
       "1              Complaint                                               NaN\n",
       "2  Routine - Unscheduled    Inadequate warewashing facilities or equipment\n",
       "3  Routine - Unscheduled                             Improper food storage\n",
       "4  Routine - Unscheduled  Unapproved or unmaintained equipment or utensils"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "data[['type', 'violation']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Encoding Nominal Features\n",
    "\n",
    "Since nominal features don't have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called **one-hot encoding**. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value. \n",
    "\n",
    "The figure below illustrates how one-hot encoding for a \"day\" (of the week) column: \n",
    "\n",
    "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/ohe.png\n",
    "---\n",
    "width: 65%\n",
    "name: directive-fig\n",
    "---\n",
    "One-hot encoding\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas has a built-in `.get_dummies` function for doing this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Complaint</th>\n",
       "      <th>New Construction</th>\n",
       "      <th>New Ownership</th>\n",
       "      <th>Reinspection/Followup</th>\n",
       "      <th>Routine - Unscheduled</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Complaint  New Construction  New Ownership  Reinspection/Followup  \\\n",
       "0          0                 0              0                      0   \n",
       "1          0                 0              0                      0   \n",
       "2          1                 0              0                      0   \n",
       "3          0                 0              0                      0   \n",
       "4          0                 0              0                      0   \n",
       "\n",
       "   Routine - Unscheduled  \n",
       "0                      1  \n",
       "1                      1  \n",
       "2                      0  \n",
       "3                      1  \n",
       "4                      1  "
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(data['type']).head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}