{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tabular data (`pd.DataFrame`), as discussed previously, is made up of observations (rows) and features (columns). Data type (`df.dtypes`) of features fall into two primary categories: **numeric** and **categorical**.\n", "\n", "There also exists a third special category of data type called **missing**. Missing data is a special data type because it is not a data type at all. It is a placeholder for a value that is not known or not applicable. Missing data is represented by `NaN` (not a number) in pandas. More on missing data in a bit.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/featuretypes.png\n", "---\n", "width: 100%\n", "name: directive-fig\n", "---\n", "Classification of feature types\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To study these feature types, we will use the **dataset of food safety scores** for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health. " ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idzipphonelatlngtypescoreriskviolation
07006494103.01.415565e+10NaNNaNRoutine - Unscheduled75.0High RiskImproper reheating of food
19003994103.0NaNNaNNaNRoutine - Unscheduled81.0High RiskHigh risk food holding temperature
28905994115.01.415369e+10NaNNaNComplaintNaNNaNNaN
39104494112.0NaNNaNNaNRoutine - Unscheduled84.0Moderate RiskInadequate and inaccessible handwashing facili...
46276894122.0NaN37.765421-122.477256Routine - Unscheduled90.0Low RiskFood safety certificate or food handler card n...
\n", "
" ], "text/plain": [ " id zip phone lat lng type \\\n", "0 70064 94103.0 1.415565e+10 NaN NaN Routine - Unscheduled \n", "1 90039 94103.0 NaN NaN NaN Routine - Unscheduled \n", "2 89059 94115.0 1.415369e+10 NaN NaN Complaint \n", "3 91044 94112.0 NaN NaN NaN Routine - Unscheduled \n", "4 62768 94122.0 NaN 37.765421 -122.477256 Routine - Unscheduled \n", "\n", " score risk violation \n", "0 75.0 High Risk Improper reheating of food \n", "1 81.0 High Risk High risk food holding temperature \n", "2 NaN NaN NaN \n", "3 84.0 Moderate Risk Inadequate and inaccessible handwashing facili... \n", "4 90.0 Low Risk Food safety certificate or food handler card n... " ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants_truncated.csv', index_col=0)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Missing Data\n", "\n", "Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on pandas objects exclude missing data by default.\n", "\n", "The way that missing data is represented in pandas objects is somewhat imperfect, but it is sufficient for most real-world use. For data with `float64` dtype, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data.\n", "\n", "We call this a _sentinel value_: when present, it indicates a missing (or null) value.\n", "\n", "The isna method gives us a Boolean Series with `True` where values are null:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "zip 1\n", "phone 27\n", "lat 30\n", "lng 30\n", "type 0\n", "score 15\n", "risk 17\n", "violation 17\n", "dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In pandas, missing data is also refered to as NA, which stands for _Not Available_. In statistics applications, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example). When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.\n", "\n", "The built-in Python `None` value is also treated as NA.\n", "\n", "| Method | Description |\n", "| :------------ | -------------: |\n", "| `dropna` | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |\n", "| `fillna` | Fill in missing data with some value |\n", "| `isna`\t| Return Boolean values indicating which values are missing/NA. |\n", "| `notna`\t| Negation of `isna`, returns `True` for non-NA values and `False` for NA values. |\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical Features\n", "\n", "Numeric data is data that can be represented as numbers. These variables generally describe some numeric _quantity_ or _amount_ and are also sometimes referred to as \"quantitative\" variables. \n", "\n", "Since numerical features are already represented as numbers, they are already ready to be used in machine learning models and there is no need to encode them.\n", "\n", "In the example above, numerical features include `zip`, `phone`, `lat`, `lng`, `score`. " ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zipphonelatlngscore
094105NaN37.787925-122.40095382.0
194109NaN37.786108-122.425764NaN
294115NaN37.791607-122.43456382.0
394115NaN37.788932-122.43389578.0
494110NaN37.739161-122.41696794.0
\n", "
" ], "text/plain": [ " zip phone lat lng score\n", "0 94105 NaN 37.787925 -122.400953 82.0\n", "1 94109 NaN 37.786108 -122.425764 NaN\n", "2 94115 NaN 37.791607 -122.434563 82.0\n", "3 94115 NaN 37.788932 -122.433895 78.0\n", "4 94110 NaN 37.739161 -122.416967 94.0" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['zip', 'phone', 'lat', 'lng', 'score']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discrete Features\n", "\n", "Discrete data is data that is counted. For example, the number of students in a class is discrete data. You can count the number of students in a class. You can not count the number of students in a class and get a fraction of a student. You can only count whole students.\n", "\n", "In the restaurants inspection data set, `zip`, `phone`, `score` are discrete features." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zipphonescore
094105NaN82.0
194109NaNNaN
294115NaN82.0
394115NaN78.0
494110NaN94.0
\n", "
" ], "text/plain": [ " zip phone score\n", "0 94105 NaN 82.0\n", "1 94109 NaN NaN\n", "2 94115 NaN 82.0\n", "3 94115 NaN 78.0\n", "4 94110 NaN 94.0" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['zip', 'phone', 'score']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Continuous Features\n", "\n", "Continuous data is data that is measured. For example, the height of a student is continuous data. You can measure the height of a student. You can measure the height of a student and get a fraction of a student. You can measure a student and get a height of 5 feet and 6.5 inches.\n", "\n", "In the restaurants inspection data set, `lat`, `lng` are continuous features." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlng
037.787925-122.400953
137.786108-122.425764
237.791607-122.434563
337.788932-122.433895
437.739161-122.416967
\n", "
" ], "text/plain": [ " lat lng\n", "0 37.787925 -122.400953\n", "1 37.786108 -122.425764\n", "2 37.791607 -122.434563\n", "3 37.788932 -122.433895\n", "4 37.739161 -122.416967" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['lat', 'lng']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical Features\n", "\n", "Categorical data is data that is not numeric. It is often represented as text or a set of text values. These variables generally describe some _characteristic_ or _quality_ of a data unit, and are also sometimes referred to as \"qualitative\" variables." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeriskviolation
0Routine - UnscheduledHigh RiskHigh risk food holding temperature
1ComplaintNaNNaN
2Routine - UnscheduledLow RiskInadequate warewashing facilities or equipment
3Routine - UnscheduledLow RiskImproper food storage
4Routine - UnscheduledLow RiskUnapproved or unmaintained equipment or utensils
\n", "
" ], "text/plain": [ " type risk \\\n", "0 Routine - Unscheduled High Risk \n", "1 Complaint NaN \n", "2 Routine - Unscheduled Low Risk \n", "3 Routine - Unscheduled Low Risk \n", "4 Routine - Unscheduled Low Risk \n", "\n", " violation \n", "0 High risk food holding temperature \n", "1 NaN \n", "2 Inadequate warewashing facilities or equipment \n", "3 Improper food storage \n", "4 Unapproved or unmaintained equipment or utensils " ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['type', 'risk', 'violation']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ordinal Features\n", "\n", "Ordinal data is data that is ordered in some way. For example, the size of a t-shirt is ordinal data. The sizes are ordered from smallest to largest. The sizes are more or less than each other. They are different and ordered.\n" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
risk
0High Risk
1NaN
2Low Risk
3Low Risk
4Low Risk
\n", "
" ], "text/plain": [ " risk\n", "0 High Risk\n", "1 NaN\n", "2 Low Risk\n", "3 Low Risk\n", "4 Low Risk" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['risk']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Encoding Ordinal Features\n", "\n", "Ordinal features can be encoded using a technique called **label encoding**. Label encoding is simply converting each value in a column to a number. For example, the sizes of t-shirts could be represented as 0 (XS), 1 (S), 2 (M), 3 (L), 4 (XL), 5 (XXL)." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idzipphonelatlngtypescoreriskviolationrisk_enc
06445494105NaN37.787925-122.400953Routine - Unscheduled82.0High RiskHigh risk food holding temperature2.0
13301494109NaN37.786108-122.425764ComplaintNaNNaNNaNNaN
2152694115NaN37.791607-122.434563Routine - Unscheduled82.0Low RiskInadequate warewashing facilities or equipment0.0
37394115NaN37.788932-122.433895Routine - Unscheduled78.0Low RiskImproper food storage0.0
46640294110NaN37.739161-122.416967Routine - Unscheduled94.0Low RiskUnapproved or unmaintained equipment or utensils0.0
\n", "
" ], "text/plain": [ " id zip phone lat lng type score \\\n", "0 64454 94105 NaN 37.787925 -122.400953 Routine - Unscheduled 82.0 \n", "1 33014 94109 NaN 37.786108 -122.425764 Complaint NaN \n", "2 1526 94115 NaN 37.791607 -122.434563 Routine - Unscheduled 82.0 \n", "3 73 94115 NaN 37.788932 -122.433895 Routine - Unscheduled 78.0 \n", "4 66402 94110 NaN 37.739161 -122.416967 Routine - Unscheduled 94.0 \n", "\n", " risk violation risk_enc \n", "0 High Risk High risk food holding temperature 2.0 \n", "1 NaN NaN NaN \n", "2 Low Risk Inadequate warewashing facilities or equipment 0.0 \n", "3 Low Risk Improper food storage 0.0 \n", "4 Low Risk Unapproved or unmaintained equipment or utensils 0.0 " ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['risk_enc'] = data['risk'].replace({'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2})\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Nominal Features\n", "\n", "Nominal data is data that is not ordered in any way. For example, the color of a car is nominal data. There is no order to the colors. The colors are not more or less than each other. They are just different." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeviolation
0Routine - UnscheduledHigh risk food holding temperature
1ComplaintNaN
2Routine - UnscheduledInadequate warewashing facilities or equipment
3Routine - UnscheduledImproper food storage
4Routine - UnscheduledUnapproved or unmaintained equipment or utensils
\n", "
" ], "text/plain": [ " type violation\n", "0 Routine - Unscheduled High risk food holding temperature\n", "1 Complaint NaN\n", "2 Routine - Unscheduled Inadequate warewashing facilities or equipment\n", "3 Routine - Unscheduled Improper food storage\n", "4 Routine - Unscheduled Unapproved or unmaintained equipment or utensils" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data[['type', 'violation']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Encoding Nominal Features\n", "\n", "Since nominal features don't have any order, encoding them requires some creativity. The most common way to encode nominal features is to use a technique called **one-hot encoding**. One-hot encoding creates a new column for each unique value in the nominal feature. Each new column is a binary feature that indicates whether or not the original observation had that value. \n", "\n", "The figure below illustrates how one-hot encoding for a \"day\" (of the week) column: \n", "\n", "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/ohe.png\n", "---\n", "width: 65%\n", "name: directive-fig\n", "---\n", "One-hot encoding\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas has a built-in `.get_dummies` function for doing this:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ComplaintNew ConstructionNew OwnershipReinspection/FollowupRoutine - Unscheduled
000001
100001
210000
300001
400001
\n", "
" ], "text/plain": [ " Complaint New Construction New Ownership Reinspection/Followup \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 1 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "\n", " Routine - Unscheduled \n", "0 1 \n", "1 1 \n", "2 0 \n", "3 1 \n", "4 1 " ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(data['type']).head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }