{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas IV: Aggregation\n", "\n", "In this notebook, we will learn how to aggregate data using pandas. This generally entails grouping data by a certain column's values and then applying a function to the groups." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation with `.groupby`\n", "\n", "Up until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects.\n", "\n", "``` {figure} https://pandas.pydata.org/docs/_images/06_groupby.svg\n", "---\n", "name: groupby\n", "align: center\n", "width: 95%\n", "---\n", "GroupBy operation broken down into split-apply-combine steps.\n", "```\n", "\n", "``` {figure} https://pandas.pydata.org/docs/_images/06_valuecounts.svg\n", "---\n", "name: valuecounts\n", "align: center\n", "width: 95%\n", "---\n", "Value counts for a single column.\n", "```\n", "\n", "``` {figure} https://i.ibb.co/y0P3GXq/groupby1.png\n", "---\n", "name: groupby_select_detail\n", "align: center\n", "width: 95%\n", "---\n", "Detail of the split-apply-combine operation.\n", "```\n", "\n", "\n", "A groupby operation involves some combination of **splitting a DataFrame into grouped subframes**, **applying a function**, and **combining the results**.\n", "\n", "For some arbitrary DataFrame `df` below, the code `df.groupby(\"year\").sum()` does the following:\n", "\n", "- **Splits** the DataFrame into sub-DataFrames with rows belonging to the same year.\n", "- **Applies** the `sum` function to each column of each sub-DataFrame.\n", "- **Combines** the results of `sum` into a single DataFrame, indexed by `year`.\n", "\n", "
groupby_demo
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Let's say we had baby names for all years in a single DataFrame `names`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYear
0MaryF70651880
1AnnaF26041880
2EmmaF20031880
3ElizabethF19391880
4MinnieF17461880
...............
31677ZyellM52023
31678ZyenM52023
31679ZymirrM52023
31680ZyquanM52023
31681ZyrinM52023
\n", "

2117219 rows × 4 columns

\n", "
" ], "text/plain": [ " Name Sex Count Year\n", "0 Mary F 7065 1880\n", "1 Anna F 2604 1880\n", "2 Emma F 2003 1880\n", "3 Elizabeth F 1939 1880\n", "4 Minnie F 1746 1880\n", "... ... .. ... ...\n", "31677 Zyell M 5 2023\n", "31678 Zyen M 5 2023\n", "31679 Zymirr M 5 2023\n", "31680 Zyquan M 5 2023\n", "31681 Zyrin M 5 2023\n", "\n", "[2117219 rows x 4 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import urllib.request\n", "import os.path\n", "import pandas as pd \n", "\n", "# Download data from the web directly\n", "data_url = \"https://www.ssa.gov/oact/babynames/names.zip\"\n", "local_filename = \"../data/names.zip\"\n", "if not os.path.exists(local_filename): # if the data exists don't download again\n", " with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:\n", " f.write(resp.read())\n", "\n", " \n", "# Load data without unzipping the file\n", "import zipfile\n", "names = [] \n", "with zipfile.ZipFile(local_filename, \"r\") as zf:\n", " data_files = [f for f in zf.filelist if f.filename[-3:] == \"txt\"]\n", " def extract_year_from_filename(fn):\n", " return int(fn[3:7])\n", " for f in data_files:\n", " year = extract_year_from_filename(f.filename)\n", " with zf.open(f) as fp:\n", " df = pd.read_csv(fp, names=[\"Name\", \"Sex\", \"Count\"])\n", " df[\"Year\"] = year\n", " names.append(df)\n", "names = pd.concat(names)\n", "\n", "names" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "names.to_csv(\"../data/names.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, if we wanted to aggregate all rows in `names` for a given year, we would need `names.groupby(\"Year\")`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names.groupby(\"Year\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does this strange output mean? Calling [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) has generated a `GroupBy` object. You can imagine this as a set of \"mini\" sub-DataFrames, where each subframe contains all of the rows from `names` that correspond to a particular year. \n", "\n", "The diagram below shows a simplified view of `names` to help illustrate this idea.\n", "\n", "![Creating a GroupBy object](../assets/gb.png)\n", "\n", "We can't work with a `GroupBy` object directly – that is why you saw that strange output earlier, rather than a standard view of a DataFrame. To actually manipulate values within these \"mini\" DataFrames, we'll need to call an *aggregation method*. This is a method that tells `pandas` how to aggregate the values within the `GroupBy` object. Once the aggregation is applied, `pandas` will return a normal (now grouped) DataFrame.\n", "\n", "Aggregation functions (`.min()`, `.max()`, `.mean()`, `.sum()`, etc.) are the most common way to work with `GroupBy` objects. These functions are applied to each column of a \"mini\" grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let's see this in action by finding the `sum` of all counts for each year in `names` – this is equivalent to finding the number of babies born in each year. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Year
1880201484
1881192690
1882221533
1883216944
1884243461
\n", "
" ], "text/plain": [ " Count\n", "Year \n", "1880 201484\n", "1881 192690\n", "1882 221533\n", "1883 216944\n", "1884 243461" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names.groupby(\"Year\").sum().head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of `names`, which is why we see smaller values for the summed counts.\n", "\n", "![Performing an aggregation](https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/agg.png)\n", "\n", "Calling `.agg` has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by `\"Year\"`, with a single row for each unique year in the original `names` DataFrame.\n", "\n", "You may be wondering: where did the `\"State\"`, `\"Sex\"`, and `\"Name\"` columns go? Logically, it doesn't make sense to `sum` the string data in these columns (how would we add \"Mary\" + \"Ann\"?). Because of this, `pandas` will simply omit these columns when it performs the aggregation on the DataFrame. Since this happens implicitly, without the user specifying that these columns should be ignored, it's easy to run into troubling situations where columns are removed without the programmer noticing. It is better coding practice to select *only* the columns we care about before performing the aggregation.\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Year
1880201484
1881192690
1882221533
1883216944
1884243461
\n", "
" ], "text/plain": [ " Count\n", "Year \n", "1880 201484\n", "1881 192690\n", "1882 221533\n", "1883 216944\n", "1884 243461" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same result, but now we explicitly tell pandas to only consider the \"Count\" column when summing\n", "names.groupby(\"Year\")[[\"Count\"]].sum().head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:\n", "\n", "* Take in a `Series` of data (a single column of the grouped subframe)\n", "* Return a single value that aggregates this `Series`\n", "\n", "Because of this fairly broad requirement, `pandas` offers many ways of computing an aggregation.\n", "\n", "**In-built** Python operations – such as `sum`, `max`, and `min` – are automatically recognized by `pandas`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Aaban16
Aabha9
Aabid6
Aabidah5
Aabir5
\n", "
" ], "text/plain": [ " Count\n", "Name \n", "Aaban 16\n", "Aabha 9\n", "Aabid 6\n", "Aabidah 5\n", "Aabir 5" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the maximum count for each name in any year?\n", "names.groupby(\"Name\")[[\"Count\"]].max().head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Aaban5
Aabha5
Aabid5
Aabidah5
Aabir5
\n", "
" ], "text/plain": [ " Count\n", "Name \n", "Aaban 5\n", "Aabha 5\n", "Aabid 5\n", "Aabidah 5\n", "Aabir 5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the minimum count for each name in any year?\n", "names.groupby(\"Name\")[[\"Count\"]].min().head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Aaban10.000000
Aabha6.375000
Aabid5.333333
Aabidah5.000000
Aabir5.000000
\n", "
" ], "text/plain": [ " Count\n", "Name \n", "Aaban 10.000000\n", "Aabha 6.375000\n", "Aabid 5.333333\n", "Aabidah 5.000000\n", "Aabir 5.000000" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What is the average count for each name across all years?\n", "names.groupby(\"Name\")[[\"Count\"]].mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` also offers a number of in-built functions for aggregation. Some examples include:\n", "\n", "* `.sum()`\n", "* `.max()`\n", "* `.min()`\n", "* `.mean()`\n", "* `.first()`\n", "* `.last()`\n", "\n", "The latter two entries in this list – `\"first\"` and `\"last\"` – are unique to `pandas`. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where *multiple* columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.\n", "\n", "Let's illustrate this with an example. Say we add a new column to `names` that contains the first letter of each name. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameFirst LetterYear
0MaryM1880
1AnnaA1880
2EmmaE1880
3ElizabethE1880
4MinnieM1880
\n", "
" ], "text/plain": [ " Name First Letter Year\n", "0 Mary M 1880\n", "1 Anna A 1880\n", "2 Emma E 1880\n", "3 Elizabeth E 1880\n", "4 Minnie M 1880" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Imagine we had an additional column, \"First Letter\". We'll explain this code next week\n", "names[\"First Letter\"] = names[\"Name\"].apply(lambda x: x[0])\n", "\n", "# We construct a simplified DataFrame containing just a subset of columns\n", "names_new = names[[\"Name\", \"First Letter\", \"Year\"]]\n", "names_new.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we form groups for each name in the dataset, `\"First Letter\"` will be the same for all members of the group. This means that if we simply select the first entry for `\"First Letter\"` in the group, we'll represent all data in that group. \n", "\n", "We can use a dictionary to apply different aggregation functions to each column during grouping.\n", "\n", "![Aggregating using \"first\"](../assets/first.png)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
First LetterYear
Name
AabanA2019
AabhaA2021
AabidA2018
AabidahA2018
AabirA2018
\n", "
" ], "text/plain": [ " First Letter Year\n", "Name \n", "Aaban A 2019\n", "Aabha A 2021\n", "Aabid A 2018\n", "Aabidah A 2018\n", "Aabir A 2018" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names_new.groupby(\"Name\").agg({\"First Letter\":\"first\", \"Year\":\"max\"}).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also define aggregation functions of our own! This can be done using either a `def` or `lambda` statement. Again, the condition for a custom aggregation function is that it must take in a `Series` and output a single scalar value." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'names' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Input \u001b[0;32mIn [2]\u001b[0m, in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mratio_to_peak\u001b[39m(series):\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m series\u001b[38;5;241m.\u001b[39miloc[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m]\u001b[38;5;241m/\u001b[39m\u001b[38;5;28mmax\u001b[39m(series)\n\u001b[0;32m----> 4\u001b[0m \u001b[43mnames\u001b[49m\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mName\u001b[39m\u001b[38;5;124m\"\u001b[39m)[[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mYear\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCount\u001b[39m\u001b[38;5;124m\"\u001b[39m]]\u001b[38;5;241m.\u001b[39mapply(ratio_to_peak)\n", "\u001b[0;31mNameError\u001b[0m: name 'names' is not defined" ] } ], "source": [ "def ratio_to_peak(series):\n", " return series.iloc[-1]/max(series)\n", "\n", "names.groupby(\"Name\")[[\"Year\", \"Count\"]].apply(ratio_to_peak)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "`lambda` functions are a special type of function that can be defined in a single line. They are also often refered to as _\"anonymous\"_ functions because these functions don't have a name. They are useful for simple functions that are not used elsewhere in your code.\n", "```" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCount
Name
Aaban1.00.375000
Aabha1.00.555556
Aabid1.01.000000
Aabidah1.01.000000
Aabir1.01.000000
.........
Zyvion1.01.000000
Zyvon1.01.000000
Zyyanna1.01.000000
Zyyon1.01.000000
Zzyzx1.01.000000
\n", "

101338 rows × 2 columns

\n", "
" ], "text/plain": [ " Year Count\n", "Name \n", "Aaban 1.0 0.375000\n", "Aabha 1.0 0.555556\n", "Aabid 1.0 1.000000\n", "Aabidah 1.0 1.000000\n", "Aabir 1.0 1.000000\n", "... ... ...\n", "Zyvion 1.0 1.000000\n", "Zyvon 1.0 1.000000\n", "Zyyanna 1.0 1.000000\n", "Zyyon 1.0 1.000000\n", "Zzyzx 1.0 1.000000\n", "\n", "[101338 rows x 2 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Alternatively, using lambda\n", "names.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(lambda s: s.iloc[-1]/max(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregation with `lambda` Functions\n", "\n", "We'll work with the `elections` DataFrame again." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
01824Andrew JacksonDemocratic-Republican151271loss57.210122
11824John Quincy AdamsDemocratic-Republican113142win42.789878
21828Andrew JacksonDemocratic642806win56.203927
31828John Quincy AdamsNational Republican500897loss43.796073
41832Andrew JacksonDemocratic702735win54.574789
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result \\\n", "0 1824 Andrew Jackson Democratic-Republican 151271 loss \n", "1 1824 John Quincy Adams Democratic-Republican 113142 win \n", "2 1828 Andrew Jackson Democratic 642806 win \n", "3 1828 John Quincy Adams National Republican 500897 loss \n", "4 1832 Andrew Jackson Democratic 702735 win \n", "\n", " % \n", "0 57.210122 \n", "1 42.789878 \n", "2 56.203927 \n", "3 43.796073 \n", "4 54.574789 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "url = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv\"\n", "elections = pd.read_csv(url)\n", "elections.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we wish to aggregate our DataFrame using a non-standard function – for example, a function of our own design? We can do so by combining `.agg` with `lambda` expressions.\n", "\n", "Let's first consider a puzzle to jog our memory. We will attempt to find the `Candidate` from each `Party` with the highest `%` of votes. \n", "\n", "A naive approach may be to group by the `Party` column and aggregate by the maximum." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePopular voteResult%
Party
American1976Thomas J. Anderson873053loss21.554001
American Independent1976Lester Maddox9901118loss13.571218
Anti-Masonic1832William Wirt100715loss7.821583
Anti-Monopoly1884Benjamin Butler134294loss1.335838
Citizens1980Barry Commoner233052loss0.270182
Communist1932William Z. Foster103307loss0.261069
Constitution2016Michael Peroutka203091loss0.152398
Constitutional Union1860John Bell590901loss12.639283
Democratic2020Woodrow Wilson81268924win61.344703
Democratic-Republican1824John Quincy Adams151271win57.210122
\n", "
" ], "text/plain": [ " Year Candidate Popular vote Result \\\n", "Party \n", "American 1976 Thomas J. Anderson 873053 loss \n", "American Independent 1976 Lester Maddox 9901118 loss \n", "Anti-Masonic 1832 William Wirt 100715 loss \n", "Anti-Monopoly 1884 Benjamin Butler 134294 loss \n", "Citizens 1980 Barry Commoner 233052 loss \n", "Communist 1932 William Z. Foster 103307 loss \n", "Constitution 2016 Michael Peroutka 203091 loss \n", "Constitutional Union 1860 John Bell 590901 loss \n", "Democratic 2020 Woodrow Wilson 81268924 win \n", "Democratic-Republican 1824 John Quincy Adams 151271 win \n", "\n", " % \n", "Party \n", "American 21.554001 \n", "American Independent 13.571218 \n", "Anti-Masonic 7.821583 \n", "Anti-Monopoly 1.335838 \n", "Citizens 0.270182 \n", "Communist 0.261069 \n", "Constitution 0.152398 \n", "Constitutional Union 12.639283 \n", "Democratic 61.344703 \n", "Democratic-Republican 57.210122 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections.groupby(\"Party\").agg(max).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach is clearly wrong – the DataFrame claims that Woodrow Wilson won the presidency in 2020.\n", "\n", "Why is this happening? Here, the `max` aggregation function is taken over every column *independently*. Among Democrats, `max` is computing:\n", "\n", "- The most recent `Year` a Democratic candidate ran for president (2020)\n", "- The `Candidate` with the alphabetically \"largest\" name (\"Woodrow Wilson\")\n", "- The `Result` with the alphabetically \"largest\" outcome (\"win\")\n", "\n", "Instead, let's try a different approach. We will:\n", "\n", "1. Sort the DataFrame so that rows are in descending order of `%`\n", "2. Group by `Party` and select the first row of each sub-DataFrame\n", "\n", "While it may seem unintuitive, sorting `elections` by descending order of `%` is extremely helpful. If we then group by `Party`, the first row of each groupby object will contain information about the `Candidate` with the highest voter `%`.\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
1141964Lyndon JohnsonDemocratic43127041win61.344703
911936Franklin RooseveltDemocratic27752648win60.978107
1201972Richard NixonRepublican47168710win60.907806
791920Warren HardingRepublican16144093win60.574501
1331984Ronald ReaganRepublican54455472win59.023326
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result %\n", "114 1964 Lyndon Johnson Democratic 43127041 win 61.344703\n", "91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107\n", "120 1972 Richard Nixon Republican 47168710 win 60.907806\n", "79 1920 Warren Harding Republican 16144093 win 60.574501\n", "133 1984 Ronald Reagan Republican 54455472 win 59.023326" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections_sorted_by_percent = elections.sort_values(\"%\", ascending=False)\n", "elections_sorted_by_percent.head(5)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePopular voteResult%
Party
American1856Millard Fillmore873053loss21.554001
American Independent1968George Wallace9901118loss13.571218
Anti-Masonic1832William Wirt100715loss7.821583
Anti-Monopoly1884Benjamin Butler134294loss1.335838
Citizens1980Barry Commoner233052loss0.270182
Communist1932William Z. Foster103307loss0.261069
Constitution2008Chuck Baldwin199750loss0.152398
Constitutional Union1860John Bell590901loss12.639283
Democratic1964Lyndon Johnson43127041win61.344703
Democratic-Republican1824Andrew Jackson151271loss57.210122
\n", "
" ], "text/plain": [ " Year Candidate Popular vote Result %\n", "Party \n", "American 1856 Millard Fillmore 873053 loss 21.554001\n", "American Independent 1968 George Wallace 9901118 loss 13.571218\n", "Anti-Masonic 1832 William Wirt 100715 loss 7.821583\n", "Anti-Monopoly 1884 Benjamin Butler 134294 loss 1.335838\n", "Citizens 1980 Barry Commoner 233052 loss 0.270182\n", "Communist 1932 William Z. Foster 103307 loss 0.261069\n", "Constitution 2008 Chuck Baldwin 199750 loss 0.152398\n", "Constitutional Union 1860 John Bell 590901 loss 12.639283\n", "Democratic 1964 Lyndon Johnson 43127041 win 61.344703\n", "Democratic-Republican 1824 Andrew Jackson 151271 loss 57.210122" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections_sorted_by_percent.groupby(\"Party\").agg(lambda x : x.iloc[0]).head(10)\n", "\n", "# Equivalent to the below code\n", "# elections_sorted_by_percent.groupby(\"Party\").agg('first').head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an illustration of the process:\n", "\n", "groupby_demo\n", "\n", "Notice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter `%`.\n", "\n", "More generally, `lambda` functions are used to design custom aggregation functions that aren't pre-defined by Python. The input parameter `x` to the `lambda` function is a `GroupBy` object. Therefore, it should make sense why `lambda x : x.iloc[0]` selects the first row in each groupby object.\n", "\n", "\n", "In fact, there's a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We've given a few examples below. \n", "\n", "**Note**: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in `pandas`.\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
221856Millard FillmoreAmerican873053loss21.554001
1151968George WallaceAmerican Independent9901118loss13.571218
61832William WirtAnti-Masonic100715loss7.821583
381884Benjamin ButlerAnti-Monopoly134294loss1.335838
1271980Barry CommonerCitizens233052loss0.270182
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result \\\n", "22 1856 Millard Fillmore American 873053 loss \n", "115 1968 George Wallace American Independent 9901118 loss \n", "6 1832 William Wirt Anti-Masonic 100715 loss \n", "38 1884 Benjamin Butler Anti-Monopoly 134294 loss \n", "127 1980 Barry Commoner Citizens 233052 loss \n", "\n", " % \n", "22 21.554001 \n", "115 13.571218 \n", "6 7.821583 \n", "38 1.335838 \n", "127 0.270182 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using the idxmax function\n", "best_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]\n", "best_per_party.head(5)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
1481996John HagelinNatural Law113670loss0.118219
1642008Chuck BaldwinConstitution199750loss0.152398
1101956T. Coleman AndrewsStates' Rights107929loss0.174883
1471996Howard PhillipsTaxpayers184656loss0.192045
1361988Lenora FulaniNew Alliance217221loss0.237804
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result %\n", "148 1996 John Hagelin Natural Law 113670 loss 0.118219\n", "164 2008 Chuck Baldwin Constitution 199750 loss 0.152398\n", "110 1956 T. Coleman Andrews States' Rights 107929 loss 0.174883\n", "147 1996 Howard Phillips Taxpayers 184656 loss 0.192045\n", "136 1988 Lenora Fulani New Alliance 217221 loss 0.237804" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using the .drop_duplicates function\n", "best_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')\n", "best_per_party2.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other `GroupBy` Features\n", "\n", "There are many aggregation methods we can use with `.agg`. Some useful options are:\n", "\n", "* [`.mean`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html#pandas.core.groupby.DataFrameGroupBy.mean): creates a new DataFrame with the mean value of each group\n", "* [`.sum`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html#pandas.core.groupby.DataFrameGroupBy.sum): creates a new DataFrame with the sum of each group\n", "* [`.max`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html#pandas.core.groupby.DataFrameGroupBy.max) and [`.min`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html#pandas.core.groupby.DataFrameGroupBy.min): creates a new DataFrame with the maximum/minimum value of each group\n", "* [`.first`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html#pandas.core.groupby.DataFrameGroupBy.first) and [`.last`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html#pandas.core.groupby.DataFrameGroupBy.last): creates a new DataFrame with the first/last row in each group\n", "* [`.size`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html#pandas.core.groupby.DataFrameGroupBy.size): creates a new **Series** with the number of entries in each group\n", "* [`.count`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html#pandas.core.groupby.DataFrameGroupBy.count): creates a new **DataFrame** with the number of entries, excluding missing values. \n", "\n", "Note the slight difference between `.size()` and `.count()`: while `.size()` returns a Series and counts the number of entries including the missing values, `.count()` returns a DataFrame and counts the number of entries in each column excluding missing values. Here's an example:\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
letternumstate
0A1.0None
1A2.0tx
2B3.0fl
3C4.0hi
4CNaNNone
5C4.0ak
\n", "
" ], "text/plain": [ " letter num state\n", "0 A 1.0 None\n", "1 A 2.0 tx\n", "2 B 3.0 fl\n", "3 C 4.0 hi\n", "4 C NaN None\n", "5 C 4.0 ak" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'letter':['A','A','B','C','C','C'], \n", " 'num':[1,2,3,4,None,4], \n", " 'state':[None, 'tx', 'fl', 'hi', None, 'ak']})\n", "df" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "letter\n", "A 2\n", "B 1\n", "C 3\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"letter\").size()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numstate
letter
A21
B11
C22
\n", "
" ], "text/plain": [ " num state\n", "letter \n", "A 2 1\n", "B 1 1\n", "C 2 2" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"letter\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might recall that the `value_counts()` function in the previous note does something similar. It turns out `value_counts()` and `groupby.size()` are the same, except `value_counts()` sorts the resulting Series in descending order automatically. " ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "C 3\n", "A 2\n", "B 1\n", "Name: letter, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"letter\"].value_counts()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "hese (and other) aggregation functions are so common that `pandas` allows for writing shorthand. Instead of explicitly stating the use of `.agg`, we can call the function directly on the `GroupBy` object.\n", "\n", "For example, the following are equivalent:\n", "\n", "- `elections.groupby(\"Candidate\").agg(mean)` \n", "- `elections.groupby(\"Candidate\").mean()`\n", "\n", "There are many other methods that `pandas` supports. You can check them out on the [`pandas` documentation](https://pandas.pydata.org/docs/reference/groupby.html).\n", "\n", "\n", "\n", "
\n", "\n", "groupby_demo\n", "\n", "### Filtering by Group\n", "\n", "Another common use for `GroupBy` objects is to filter data by group. \n", "\n", "`groupby.filter` takes an argument $\\text{f}$, where $\\text{f}$ is a function that:\n", "\n", "- Takes a DataFrame object as input\n", "- Returns a single `True` or `False` for the each sub-DataFrame\n", "\n", "Sub-DataFrames that correspond to `True` are returned in the final result, whereas those with a `False` value are not. Importantly, `groupby.filter` is different from `groupby.agg` in that an *entire* sub-DataFrame is returned in the final DataFrame, not just a single row. As a result, `groupby.filter` preserves the original indices.\n", "\n", "To illustrate how this happens, consider the following `.filter` function applied on some arbitrary data. Say we want to identify \"tight\" election years – that is, we want to find all rows that correspond to elections years where all candidates in that year won a similar portion of the total vote. Specifically, let's find all rows corresponding to a year where no candidate won more than 45% of the total vote. \n", "\n", "In other words, we want to: \n", "\n", "- Find the years where the maximum `%` in that year is less than 45%\n", "- Return all DataFrame rows that correspond to these years\n", "\n", "For each year, we need to find the maximum `%` among *all* rows for that year. If this maximum `%` is lower than 45%, we will tell `pandas` to keep all rows corresponding to that year. " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
231860Abraham LincolnRepublican1855993win39.699408
241860John BellConstitutional Union590901loss12.639283
251860John C. BreckinridgeSouthern Democratic848019loss18.138998
261860Stephen A. DouglasNorthern Democratic1380202loss29.522311
661912Eugene V. DebsSocialist901551loss6.004354
671912Eugene W. ChafinProhibition208156loss1.386325
681912Theodore RooseveltProgressive4122721loss27.457433
691912William TaftRepublican3486242loss23.218466
701912Woodrow WilsonDemocratic6296284win41.933422
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result \\\n", "23 1860 Abraham Lincoln Republican 1855993 win \n", "24 1860 John Bell Constitutional Union 590901 loss \n", "25 1860 John C. Breckinridge Southern Democratic 848019 loss \n", "26 1860 Stephen A. Douglas Northern Democratic 1380202 loss \n", "66 1912 Eugene V. Debs Socialist 901551 loss \n", "67 1912 Eugene W. Chafin Prohibition 208156 loss \n", "68 1912 Theodore Roosevelt Progressive 4122721 loss \n", "69 1912 William Taft Republican 3486242 loss \n", "70 1912 Woodrow Wilson Democratic 6296284 win \n", "\n", " % \n", "23 39.699408 \n", "24 12.639283 \n", "25 18.138998 \n", "26 29.522311 \n", "66 6.004354 \n", "67 1.386325 \n", "68 27.457433 \n", "69 23.218466 \n", "70 41.933422 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elections.groupby(\"Year\").filter(lambda sf: sf[\"%\"].max() < 45).head(9)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "What's going on here? In this example, we've defined our filtering function, $\\text{f}$, to be `lambda sf: sf[\"%\"].max() < 45`. This filtering function will find the maximum `\"%\"` value among all entries in the grouped sub-DataFrame, which we call `sf`. If the maximum value is less than 45, then the filter function will return `True` and all rows in that grouped sub-DataFrame will appear in the final output DataFrame. \n", "\n", "Examine the DataFrame above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote. \n", "\n", "You may ask: how is the `groupby.filter` procedure different to the boolean filtering we've seen previously? Boolean filtering considers *individual* rows when applying a boolean condition. For example, the code `elections[elections[\"%\"] < 45]` will check the `\"%\"` value of every single row in `elections`; if it is less than 45, then that row will be kept in the output. `groupby.filter`, in contrast, applies a boolean condition *across* all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output. \n", "\n", "## Aggregation with `.pivot_table`\n", "\n", "We know now that `.groupby` gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It's possible to group by multiple columns at once by passing in a list of column names to `.groupby`. \n", "\n", "\n", "Let's consider the `names` dataset. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we'll group by *both* the `\"Year\"` and `\"Sex\"` columns.\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYearFirst Letter
0MaryF70651880M
1AnnaF26041880A
2EmmaF20031880E
3ElizabethF19391880E
4MinnieF17461880M
\n", "
" ], "text/plain": [ " Name Sex Count Year First Letter\n", "0 Mary F 7065 1880 M\n", "1 Anna F 2604 1880 A\n", "2 Emma F 2003 1880 E\n", "3 Elizabeth F 1939 1880 E\n", "4 Minnie F 1746 1880 M" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names.head()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
YearSex
1880F90994
M110490
1881F91953
M100737
1882F107847
M113686
\n", "
" ], "text/plain": [ " Count\n", "Year Sex \n", "1880 F 90994\n", " M 110490\n", "1881 F 91953\n", " M 100737\n", "1882 F 107847\n", " M 113686" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find the total number of baby names associated with each sex for each year in the data\n", "names.groupby([\"Year\", \"Sex\"])[[\"Count\"]].sum().head(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that both `\"Year\"` and `\"Sex\"` serve as the index of the DataFrame (they are both rendered in bold). We've created a *multi-index* DataFrame where two different index values, the year and sex, are used to uniquely identify each row. \n", "\n", "This isn't the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use. \n", "\n", "Another strategy to aggregate across two columns is to create a pivot table. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.\n", "\n", "The best way to understand pivot tables is to see one in action. Let's return to our original goal of summing the total number of names associated with each combination of year and sex. We'll call the `pandas` [`.pivot_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) method to create a new table." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexFM
Year
188090994110490
188191953100737
1882107847113686
1883112319104625
1884129019114442
\n", "
" ], "text/plain": [ "Sex F M\n", "Year \n", "1880 90994 110490\n", "1881 91953 100737\n", "1882 107847 113686\n", "1883 112319 104625\n", "1884 129019 114442" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The `pivot_table` method is used to generate a Pandas pivot table\n", "names.pivot_table(\n", " index = \"Year\", \n", " columns = \"Sex\", \n", " values = \"Count\", \n", " aggfunc = sum).head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Looks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of `\"Year\"` and `\"Sex\"`.\n", "\n", "Let's take a closer look at the code implemented above. \n", "\n", "* `index = \"Year\"` specifies the column name in the original DataFrame that should be used as the index of the pivot table\n", "* `columns = \"Sex\"` specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table\n", "* `values = \"Count\"` indicates what values from the original DataFrame should be used to populate the entry for each index-column combination\n", "* `aggfunc = sum` tells `pandas` what function to use when aggregating the data specified by `values`. Here, we are summing the name counts for each pair of `\"Year\"` and `\"Sex\"`\n", "\n", "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/pivot.png\n", "---\n", "height: 400px\n", "name: pivot_table\n", "---\n", "A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame, where the rows represent one variable that you're interested in, the columns another, and the cell's some aggregate value. A pivot table also tends to include marginal values (like sums) for each row and column.\n", "```\n", "\n", "We can even include multiple values in the index or columns of our pivot tables." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountName
SexFMFM
Year
188070659655ZulaZeke
188169198769ZulaZeb
188281489557ZulaZed
188380128894ZulaZeno
188492179388ZulaZollie
188591288756ZulaZollie
\n", "
" ], "text/plain": [ " Count Name \n", "Sex F M F M\n", "Year \n", "1880 7065 9655 Zula Zeke\n", "1881 6919 8769 Zula Zeb\n", "1882 8148 9557 Zula Zed\n", "1883 8012 8894 Zula Zeno\n", "1884 9217 9388 Zula Zollie\n", "1885 9128 8756 Zula Zollie" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names_pivot = names.pivot_table(\n", " index=\"Year\", # the rows (turned into index)\n", " columns=\"Sex\", # the column values\n", " values=[\"Count\", \"Name\"], \n", " aggfunc=max, # group operation\n", ")\n", "names_pivot.head(6)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }