{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas III: Selection, Filtering and Dropping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section, we will learn how to extract and remove a subset of rows and columns in pandas. The two primary operations of data extraction are: \n",
"\n",
"1. **Selection**: Extracting subset of columns. \n",
"2. **Filtering**: Extracting subset of rows.\n",
"\n",
"Let's start by loading the dataset."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd \n",
"\n",
"url = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv\"\n",
"\n",
"elections = pd.read_csv(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Selection**: subset of columns\n",
"\n",
"To select a column in a `DataFrame`, we can use the bracket notation. That is, name of the DataFrame followed by the column name in square brackets: `df['column_name']`. \n",
"\n",
"```{figure} https://pandas.pydata.org/docs/_images/03_subset_columns.svg\n",
"---\n",
"width: 85%\n",
"name: subset_columns\n",
"---\n",
"To select a column, we can use the bracket notation: `df['column_name']`. This extracts a single column as a `Series`.\n",
"\n",
"To select multiple columns, we can pass a list of column names: `df[['column_name1', 'column_name2']]`. This extracts multiple columns as a `DataFrame`.\n",
"```\n",
"\n",
"For example, to select a column named `Candidate` from the `election` DataFrame, we can use the following code:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 Andrew Jackson\n",
"1 John Quincy Adams\n",
"2 Andrew Jackson\n",
"3 John Quincy Adams\n",
"4 Andrew Jackson\n",
" ... \n",
"177 Jill Stein\n",
"178 Joseph Biden\n",
"179 Donald Trump\n",
"180 Jo Jorgensen\n",
"181 Howard Hawkins\n",
"Name: Candidate, Length: 182, dtype: object\n"
]
}
],
"source": [
"candidates = elections['Candidate']\n",
"print(candidates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This extracts a single column as a `Series`. We can confirm this by checking the type of the output."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(candidates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To select multiple columns, we can pass a list of column names. For example, to select both `Candidate` and `Votes` columns from the `election` DataFrame, we can use the following line of code:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Candidate | \n",
" Party | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
"
\n",
" \n",
" 1 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
"
\n",
" \n",
" 2 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
"
\n",
" \n",
" 3 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
"
\n",
" \n",
" 4 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 177 | \n",
" Jill Stein | \n",
" Green | \n",
"
\n",
" \n",
" 178 | \n",
" Joseph Biden | \n",
" Democratic | \n",
"
\n",
" \n",
" 179 | \n",
" Donald Trump | \n",
" Republican | \n",
"
\n",
" \n",
" 180 | \n",
" Jo Jorgensen | \n",
" Libertarian | \n",
"
\n",
" \n",
" 181 | \n",
" Howard Hawkins | \n",
" Green | \n",
"
\n",
" \n",
"
\n",
"
182 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Candidate Party\n",
"0 Andrew Jackson Democratic-Republican\n",
"1 John Quincy Adams Democratic-Republican\n",
"2 Andrew Jackson Democratic\n",
"3 John Quincy Adams National Republican\n",
"4 Andrew Jackson Democratic\n",
".. ... ...\n",
"177 Jill Stein Green\n",
"178 Joseph Biden Democratic\n",
"179 Donald Trump Republican\n",
"180 Jo Jorgensen Libertarian\n",
"181 Howard Hawkins Green\n",
"\n",
"[182 rows x 2 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elections[['Candidate', 'Party']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This extracts multiple columns as a `DataFrame`. We can confirm as well this by checking the type of the output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"type(elections[['Candidate', 'Party']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is how we can select columns in a `DataFrame`. Next, let's learn how to filter rows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Filtering Condition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perhaps the most interesting (and useful) method of selecting data from a Series is with a filtering condition.\n",
"\n",
"First, we apply a boolean condition to the Series. This create **a new Series of boolean values**."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"a False\n",
"b False\n",
"c True\n",
"d True\n",
"dtype: bool"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})\n",
"series > 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"``` {image} https://fahadsultan.com/csc272/_images/filter.png\n",
":width: 50%\n",
":align: center\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then use this boolean condition to index into our original Series. pandas will select only the entries in the original Series that satisfy the condition."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"c 3\n",
"d 4\n",
"dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"series[series > 2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Filtering**: subset of rows\n",
"\n",
"Extracting a subset of rows from a `DataFrame` is called filtering. \n",
"\n",
"We can filter rows based on a boolean condition, similar to conditional statements (e.g., `if`, `else`) in Python.\n",
"\n",
"```{figure} https://pandas.pydata.org/docs/_images/03_subset_rows.svg\n",
"---\n",
"width: 85%\n",
"name: subset_rows\n",
"align: center\n",
"---\n",
"To filter rows based on a boolean condition, we can use the bracket notation: `df[boolean_condition]`. This extracts rows where the condition is `True`.\n",
"\n",
"To filter rows based on multiple conditions, we can use the `&` operator for `AND` and the `|` operator for `OR`.\n",
"```\n",
"\n",
"For example, to filter rows of candidates who ran for elections since 2010, we can use the following code:\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 168 | \n",
" 2012 | \n",
" Barack Obama | \n",
" Democratic | \n",
" 65915795 | \n",
" win | \n",
" 51.258484 | \n",
"
\n",
" \n",
" 169 | \n",
" 2012 | \n",
" Gary Johnson | \n",
" Libertarian | \n",
" 1275971 | \n",
" loss | \n",
" 0.992241 | \n",
"
\n",
" \n",
" 170 | \n",
" 2012 | \n",
" Jill Stein | \n",
" Green | \n",
" 469627 | \n",
" loss | \n",
" 0.365199 | \n",
"
\n",
" \n",
" 171 | \n",
" 2012 | \n",
" Mitt Romney | \n",
" Republican | \n",
" 60933504 | \n",
" loss | \n",
" 47.384076 | \n",
"
\n",
" \n",
" 172 | \n",
" 2016 | \n",
" Darrell Castle | \n",
" Constitution | \n",
" 203091 | \n",
" loss | \n",
" 0.149640 | \n",
"
\n",
" \n",
" 173 | \n",
" 2016 | \n",
" Donald Trump | \n",
" Republican | \n",
" 62984828 | \n",
" win | \n",
" 46.407862 | \n",
"
\n",
" \n",
" 174 | \n",
" 2016 | \n",
" Evan McMullin | \n",
" Independent | \n",
" 732273 | \n",
" loss | \n",
" 0.539546 | \n",
"
\n",
" \n",
" 175 | \n",
" 2016 | \n",
" Gary Johnson | \n",
" Libertarian | \n",
" 4489235 | \n",
" loss | \n",
" 3.307714 | \n",
"
\n",
" \n",
" 176 | \n",
" 2016 | \n",
" Hillary Clinton | \n",
" Democratic | \n",
" 65853514 | \n",
" loss | \n",
" 48.521539 | \n",
"
\n",
" \n",
" 177 | \n",
" 2016 | \n",
" Jill Stein | \n",
" Green | \n",
" 1457226 | \n",
" loss | \n",
" 1.073699 | \n",
"
\n",
" \n",
" 178 | \n",
" 2020 | \n",
" Joseph Biden | \n",
" Democratic | \n",
" 81268924 | \n",
" win | \n",
" 51.311515 | \n",
"
\n",
" \n",
" 179 | \n",
" 2020 | \n",
" Donald Trump | \n",
" Republican | \n",
" 74216154 | \n",
" loss | \n",
" 46.858542 | \n",
"
\n",
" \n",
" 180 | \n",
" 2020 | \n",
" Jo Jorgensen | \n",
" Libertarian | \n",
" 1865724 | \n",
" loss | \n",
" 1.177979 | \n",
"
\n",
" \n",
" 181 | \n",
" 2020 | \n",
" Howard Hawkins | \n",
" Green | \n",
" 405035 | \n",
" loss | \n",
" 0.255731 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result %\n",
"168 2012 Barack Obama Democratic 65915795 win 51.258484\n",
"169 2012 Gary Johnson Libertarian 1275971 loss 0.992241\n",
"170 2012 Jill Stein Green 469627 loss 0.365199\n",
"171 2012 Mitt Romney Republican 60933504 loss 47.384076\n",
"172 2016 Darrell Castle Constitution 203091 loss 0.149640\n",
"173 2016 Donald Trump Republican 62984828 win 46.407862\n",
"174 2016 Evan McMullin Independent 732273 loss 0.539546\n",
"175 2016 Gary Johnson Libertarian 4489235 loss 3.307714\n",
"176 2016 Hillary Clinton Democratic 65853514 loss 48.521539\n",
"177 2016 Jill Stein Green 1457226 loss 1.073699\n",
"178 2020 Joseph Biden Democratic 81268924 win 51.311515\n",
"179 2020 Donald Trump Republican 74216154 loss 46.858542\n",
"180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979\n",
"181 2020 Howard Hawkins Green 405035 loss 0.255731"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"condition = election['Year'] > 2010\n",
"\n",
"election[condition]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To filter rows based on **multiple conditions**, we can use the `&` operator for `AND` and the `|` operator for `OR`. \n",
"\n",
"For example, to filter rows of candidates who won the elections with less than 50% of the votes, we can use the following code:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" 113142 | \n",
" win | \n",
" 42.789878 | \n",
"
\n",
" \n",
" 16 | \n",
" 1848 | \n",
" Zachary Taylor | \n",
" Whig | \n",
" 1360235 | \n",
" win | \n",
" 47.309296 | \n",
"
\n",
" \n",
" 20 | \n",
" 1856 | \n",
" James Buchanan | \n",
" Democratic | \n",
" 1835140 | \n",
" win | \n",
" 45.306080 | \n",
"
\n",
" \n",
" 23 | \n",
" 1860 | \n",
" Abraham Lincoln | \n",
" Republican | \n",
" 1855993 | \n",
" win | \n",
" 39.699408 | \n",
"
\n",
" \n",
" 33 | \n",
" 1876 | \n",
" Rutherford Hayes | \n",
" Republican | \n",
" 4034142 | \n",
" win | \n",
" 48.471624 | \n",
"
\n",
" \n",
" 36 | \n",
" 1880 | \n",
" James Garfield | \n",
" Republican | \n",
" 4453337 | \n",
" win | \n",
" 48.369234 | \n",
"
\n",
" \n",
" 39 | \n",
" 1884 | \n",
" Grover Cleveland | \n",
" Democratic | \n",
" 4914482 | \n",
" win | \n",
" 48.884933 | \n",
"
\n",
" \n",
" 43 | \n",
" 1888 | \n",
" Benjamin Harrison | \n",
" Republican | \n",
" 5443633 | \n",
" win | \n",
" 47.858041 | \n",
"
\n",
" \n",
" 47 | \n",
" 1892 | \n",
" Grover Cleveland | \n",
" Democratic | \n",
" 5553898 | \n",
" win | \n",
" 46.121393 | \n",
"
\n",
" \n",
" 70 | \n",
" 1912 | \n",
" Woodrow Wilson | \n",
" Democratic | \n",
" 6296284 | \n",
" win | \n",
" 41.933422 | \n",
"
\n",
" \n",
" 74 | \n",
" 1916 | \n",
" Woodrow Wilson | \n",
" Democratic | \n",
" 9126868 | \n",
" win | \n",
" 49.367987 | \n",
"
\n",
" \n",
" 100 | \n",
" 1948 | \n",
" Harry Truman | \n",
" Democratic | \n",
" 24179347 | \n",
" win | \n",
" 49.601536 | \n",
"
\n",
" \n",
" 117 | \n",
" 1968 | \n",
" Richard Nixon | \n",
" Republican | \n",
" 31783783 | \n",
" win | \n",
" 43.565246 | \n",
"
\n",
" \n",
" 140 | \n",
" 1992 | \n",
" Bill Clinton | \n",
" Democratic | \n",
" 44909806 | \n",
" win | \n",
" 43.118485 | \n",
"
\n",
" \n",
" 144 | \n",
" 1996 | \n",
" Bill Clinton | \n",
" Democratic | \n",
" 47400125 | \n",
" win | \n",
" 49.296938 | \n",
"
\n",
" \n",
" 152 | \n",
" 2000 | \n",
" George W. Bush | \n",
" Republican | \n",
" 50456002 | \n",
" win | \n",
" 47.974666 | \n",
"
\n",
" \n",
" 173 | \n",
" 2016 | \n",
" Donald Trump | \n",
" Republican | \n",
" 62984828 | \n",
" win | \n",
" 46.407862 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"1 1824 John Quincy Adams Democratic-Republican 113142 win \n",
"16 1848 Zachary Taylor Whig 1360235 win \n",
"20 1856 James Buchanan Democratic 1835140 win \n",
"23 1860 Abraham Lincoln Republican 1855993 win \n",
"33 1876 Rutherford Hayes Republican 4034142 win \n",
"36 1880 James Garfield Republican 4453337 win \n",
"39 1884 Grover Cleveland Democratic 4914482 win \n",
"43 1888 Benjamin Harrison Republican 5443633 win \n",
"47 1892 Grover Cleveland Democratic 5553898 win \n",
"70 1912 Woodrow Wilson Democratic 6296284 win \n",
"74 1916 Woodrow Wilson Democratic 9126868 win \n",
"100 1948 Harry Truman Democratic 24179347 win \n",
"117 1968 Richard Nixon Republican 31783783 win \n",
"140 1992 Bill Clinton Democratic 44909806 win \n",
"144 1996 Bill Clinton Democratic 47400125 win \n",
"152 2000 George W. Bush Republican 50456002 win \n",
"173 2016 Donald Trump Republican 62984828 win \n",
"\n",
" % \n",
"1 42.789878 \n",
"16 47.309296 \n",
"20 45.306080 \n",
"23 39.699408 \n",
"33 48.471624 \n",
"36 48.369234 \n",
"39 48.884933 \n",
"43 47.858041 \n",
"47 46.121393 \n",
"70 41.933422 \n",
"74 49.367987 \n",
"100 49.601536 \n",
"117 43.565246 \n",
"140 43.118485 \n",
"144 49.296938 \n",
"152 47.974666 \n",
"173 46.407862 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"condition = (election['Result'] == 'win') & (election['%'] < 50)\n",
"\n",
"election[condition]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting subset of values\n",
"\n",
"To extract a subset of values, we can use `.loc[]` or `.iloc[]` with row and column indices and labels respectively. \n",
"\n",
"```{figure} https://pandas.pydata.org/docs/_images/03_subset_columns_rows.svg\n",
"\n",
"---\n",
"width: 85%\n",
"name: subset_columns_rows\n",
"align: center\n",
"---\n",
"To extract a subset of values, we can use `.loc[]` or `.iloc[]` with row and column indices and labels respectively.\n",
"```\n",
"\n",
"The `.loc[]` method is used to access a group of rows and columns by labels or a boolean array. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### `.loc[row_labels, col_labels]`\n",
"\n",
"The `.loc` operator selects rows and columns in a DataFrame by their row and column label(s), respectively. The **row labels** (commonly referred to as the **indices**) are the bold text on the far _left_ of a DataFrame, while the **column labels** are the column names found at the _top_ of a DataFrame.\n",
"\n",
"\n",
"```{figure} ../assets/loc.png\n",
"---\n",
"width: 80%\n",
"align: center\n",
"---\n",
"`.loc[i, j]` returns value(s) where row **label**(s)`== i` and column **label**(s)`== j` in the DataFrame.\n",
"``` \n",
"\n",
"To grab data with `.loc`, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the .loc function; the column labels are the second. For example, we can select the the row labeled `0` and the column labeled `Candidate` from the `elections` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elections.loc[0, 'Candidate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"To select _multiple_ rows and columns, we can use Python slice notation. Here, we select the rows from labels `0` to `3` and the columns from labels `\"Year\"` to `\"Popular vote\"`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elections.loc[0:3, 'Year':'Popular vote']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose that instead, we wanted _every_ column value for the first four rows in the `elections` DataFrame. The shorthand `:` is useful for this.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elections.loc[0:3, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a couple of things we should note. Firstly, unlike conventional Python, Pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with `.loc` is _inclusive_. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.\n",
"\n",
"Equivalently, we can use a list to obtain multiple rows and columns in our `elections` DataFrame.\n",
"elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]\n",
"\n",
"Lastly, we can interchange list and slicing notation.\n",
"elections.loc[[0, 1, 2, 3], :]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"### `.iloc[row_indices, col_indices]`\n",
"\n",
"The `.iloc[]` method is used to access a group of rows and columns by integer position.\n",
"\n",
"```{warning}\n",
"If you find yourself needing to use `.iloc` then stop and think if you are about to implement a loop. If so, there is probably a better way to do it.\n",
"```\n",
"\n",
"Slicing with `.iloc` works similarily to `.loc`, however, `.iloc` uses the index positions of rows and columns rather the labels (think to yourself: **`l`**`oc` uses **l**abels; **`i`**`loc` uses **i**ndices). The arguments to the `.iloc` function also behave similarly -– single values, lists, indices, and any combination of these are permitted.\n",
"\n",
"\n",
"```{figure} ../assets/iloc.png\n",
"---\n",
"width: 80%\n",
"align: center\n",
"---\n",
"`.iloc[i, j]` returns value(s) where row **location**(s)`== i` and column **location**(s)`== j` in the DataFrame.\n",
"``` \n",
"\n",
"Let’s begin reproducing our results from above. We’ll begin by selecting for the first presidential candidate in our `elections` DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# elections.loc[0, \"Candidate\"] - Previous approach\n",
"elections.iloc[0, 1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Notice how the first argument to both `.loc` and `.iloc` are the same. This is because the row with a label of 0 is conveniently in the 0th index (equivalently, the first position) of the `elections` DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.\n",
"\n",
"However, when we select the first four rows and columns using `.iloc`, we notice something."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach\n",
"elections.iloc[0:4, 0:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Slicing is no longer inclusive in `.iloc` -– it’s exclusive. In other words, the right-end of a slice is not included when using `.iloc`. This is one of the subtleties of `pandas` syntax; you will get used to it with practice.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach\n",
"elections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"This discussion begs the question: when should we use `.loc` vs `.iloc`? In most cases, `.loc` is generally safer to use. You can imagine `.iloc` may return incorrect values when applied to a dataset where the ordering of data can change.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### `[]`\n",
"\n",
"The `[]` selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:\n",
"\n",
"1. A slice of row numbers\n",
"2. A list of column labels\n",
"3. A single column label\n",
"\n",
"That is, `[]` is _context dependent_. Let’s see some examples.\n",
"\n",
"Say we wanted the first four rows of our `elections` DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" 113142 | \n",
" win | \n",
" 42.789878 | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" 500897 | \n",
" loss | \n",
" 43.796073 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"0 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"1 1824 John Quincy Adams Democratic-Republican 113142 win \n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 John Quincy Adams National Republican 500897 loss \n",
"\n",
" % \n",
"0 57.210122 \n",
"1 42.789878 \n",
"2 56.203927 \n",
"3 43.796073 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elections[0:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropping rows and columns\n",
"\n",
"To drop rows and columns in a `DataFrame`, we can use the `drop()` method. \n",
"\n",
"For example, to drop the first row from the `election` DataFrame, we can use the following code:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" 113142 | \n",
" win | \n",
" 42.789878 | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" 500897 | \n",
" loss | \n",
" 43.796073 | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 702735 | \n",
" win | \n",
" 54.574789 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"0 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"1 1824 John Quincy Adams Democratic-Republican 113142 win \n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 John Quincy Adams National Republican 500897 loss \n",
"4 1832 Andrew Jackson Democratic 702735 win \n",
"\n",
" % \n",
"0 57.210122 \n",
"1 42.789878 \n",
"2 56.203927 \n",
"3 43.796073 \n",
"4 54.574789 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elections.head()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" loss | \n",
" 57.210122 | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" win | \n",
" 42.789878 | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" win | \n",
" 56.203927 | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" loss | \n",
" 43.796073 | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" win | \n",
" 54.574789 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 177 | \n",
" 2016 | \n",
" Jill Stein | \n",
" Green | \n",
" loss | \n",
" 1.073699 | \n",
"
\n",
" \n",
" 178 | \n",
" 2020 | \n",
" Joseph Biden | \n",
" Democratic | \n",
" win | \n",
" 51.311515 | \n",
"
\n",
" \n",
" 179 | \n",
" 2020 | \n",
" Donald Trump | \n",
" Republican | \n",
" loss | \n",
" 46.858542 | \n",
"
\n",
" \n",
" 180 | \n",
" 2020 | \n",
" Jo Jorgensen | \n",
" Libertarian | \n",
" loss | \n",
" 1.177979 | \n",
"
\n",
" \n",
" 181 | \n",
" 2020 | \n",
" Howard Hawkins | \n",
" Green | \n",
" loss | \n",
" 0.255731 | \n",
"
\n",
" \n",
"
\n",
"
182 rows × 5 columns
\n",
"
"
],
"text/plain": [
" Year Candidate Party Result %\n",
"0 1824 Andrew Jackson Democratic-Republican loss 57.210122\n",
"1 1824 John Quincy Adams Democratic-Republican win 42.789878\n",
"2 1828 Andrew Jackson Democratic win 56.203927\n",
"3 1828 John Quincy Adams National Republican loss 43.796073\n",
"4 1832 Andrew Jackson Democratic win 54.574789\n",
".. ... ... ... ... ...\n",
"177 2016 Jill Stein Green loss 1.073699\n",
"178 2020 Joseph Biden Democratic win 51.311515\n",
"179 2020 Donald Trump Republican loss 46.858542\n",
"180 2020 Jo Jorgensen Libertarian loss 1.177979\n",
"181 2020 Howard Hawkins Green loss 0.255731\n",
"\n",
"[182 rows x 5 columns]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"elections.drop(columns=['Popular vote'])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" 500897 | \n",
" loss | \n",
" 43.796073 | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 702735 | \n",
" win | \n",
" 54.574789 | \n",
"
\n",
" \n",
" 5 | \n",
" 1832 | \n",
" Henry Clay | \n",
" National Republican | \n",
" 484205 | \n",
" loss | \n",
" 37.603628 | \n",
"
\n",
" \n",
" 6 | \n",
" 1832 | \n",
" William Wirt | \n",
" Anti-Masonic | \n",
" 100715 | \n",
" loss | \n",
" 7.821583 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 177 | \n",
" 2016 | \n",
" Jill Stein | \n",
" Green | \n",
" 1457226 | \n",
" loss | \n",
" 1.073699 | \n",
"
\n",
" \n",
" 178 | \n",
" 2020 | \n",
" Joseph Biden | \n",
" Democratic | \n",
" 81268924 | \n",
" win | \n",
" 51.311515 | \n",
"
\n",
" \n",
" 179 | \n",
" 2020 | \n",
" Donald Trump | \n",
" Republican | \n",
" 74216154 | \n",
" loss | \n",
" 46.858542 | \n",
"
\n",
" \n",
" 180 | \n",
" 2020 | \n",
" Jo Jorgensen | \n",
" Libertarian | \n",
" 1865724 | \n",
" loss | \n",
" 1.177979 | \n",
"
\n",
" \n",
" 181 | \n",
" 2020 | \n",
" Howard Hawkins | \n",
" Green | \n",
" 405035 | \n",
" loss | \n",
" 0.255731 | \n",
"
\n",
" \n",
"
\n",
"
180 rows × 6 columns
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 John Quincy Adams National Republican 500897 loss \n",
"4 1832 Andrew Jackson Democratic 702735 win \n",
"5 1832 Henry Clay National Republican 484205 loss \n",
"6 1832 William Wirt Anti-Masonic 100715 loss \n",
".. ... ... ... ... ... \n",
"177 2016 Jill Stein Green 1457226 loss \n",
"178 2020 Joseph Biden Democratic 81268924 win \n",
"179 2020 Donald Trump Republican 74216154 loss \n",
"180 2020 Jo Jorgensen Libertarian 1865724 loss \n",
"181 2020 Howard Hawkins Green 405035 loss \n",
"\n",
" % \n",
"2 56.203927 \n",
"3 43.796073 \n",
"4 54.574789 \n",
"5 37.603628 \n",
"6 7.821583 \n",
".. ... \n",
"177 1.073699 \n",
"178 51.311515 \n",
"179 46.858542 \n",
"180 1.177979 \n",
"181 0.255731 \n",
"\n",
"[180 rows x 6 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop the first row\n",
"elections.drop(index=0)\n",
"\n",
"# Drop the first two rows\n",
"elections.drop(index=[0, 1])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}