{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas IV: Aggregation\n",
    "\n",
    "In this notebook, we will learn how to aggregate data using pandas. This generally entails grouping data by a certain column's values and then applying a function to the groups."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregation with `.groupby`\n",
    "\n",
    "Up until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects.\n",
    "\n",
    "``` {figure} https://pandas.pydata.org/docs/_images/06_groupby.svg\n",
    "---\n",
    "name: groupby\n",
    "align: center\n",
    "width: 95%\n",
    "---\n",
    "GroupBy operation broken down into split-apply-combine steps.\n",
    "```\n",
    "\n",
    "``` {figure} https://pandas.pydata.org/docs/_images/06_valuecounts.svg\n",
    "---\n",
    "name: valuecounts\n",
    "align: center\n",
    "width: 95%\n",
    "---\n",
    "Value counts for a single column.\n",
    "```\n",
    "\n",
    "``` {figure} https://i.ibb.co/y0P3GXq/groupby1.png\n",
    "---\n",
    "name: groupby_select_detail\n",
    "align: center\n",
    "width: 95%\n",
    "---\n",
    "Detail of the split-apply-combine operation.\n",
    "```\n",
    "\n",
    "\n",
    "A groupby operation involves some combination of **splitting a DataFrame into grouped subframes**, **applying a function**, and **combining the results**.\n",
    "\n",
    "For some arbitrary DataFrame `df` below, the code `df.groupby(\"year\").sum()` does the following:\n",
    "\n",
    "- **Splits** the DataFrame into sub-DataFrames with rows belonging to the same year.\n",
    "- **Applies** the `sum` function to each column of each sub-DataFrame.\n",
    "- **Combines** the results of `sum` into a single DataFrame, indexed by `year`.\n",
    "\n",
    "<center><img src=\"https://i.ibb.co/mNCGC62/groupby-demo.png\" alt='groupby_demo' width='600'></center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Let's say we had baby names for all years in a single DataFrame `names`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Count</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>F</td>\n",
       "      <td>7065</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Anna</td>\n",
       "      <td>F</td>\n",
       "      <td>2604</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emma</td>\n",
       "      <td>F</td>\n",
       "      <td>2003</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>F</td>\n",
       "      <td>1939</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Minnie</td>\n",
       "      <td>F</td>\n",
       "      <td>1746</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31677</th>\n",
       "      <td>Zyell</td>\n",
       "      <td>M</td>\n",
       "      <td>5</td>\n",
       "      <td>2023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31678</th>\n",
       "      <td>Zyen</td>\n",
       "      <td>M</td>\n",
       "      <td>5</td>\n",
       "      <td>2023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31679</th>\n",
       "      <td>Zymirr</td>\n",
       "      <td>M</td>\n",
       "      <td>5</td>\n",
       "      <td>2023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31680</th>\n",
       "      <td>Zyquan</td>\n",
       "      <td>M</td>\n",
       "      <td>5</td>\n",
       "      <td>2023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31681</th>\n",
       "      <td>Zyrin</td>\n",
       "      <td>M</td>\n",
       "      <td>5</td>\n",
       "      <td>2023</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2117219 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            Name Sex  Count  Year\n",
       "0           Mary   F   7065  1880\n",
       "1           Anna   F   2604  1880\n",
       "2           Emma   F   2003  1880\n",
       "3      Elizabeth   F   1939  1880\n",
       "4         Minnie   F   1746  1880\n",
       "...          ...  ..    ...   ...\n",
       "31677      Zyell   M      5  2023\n",
       "31678       Zyen   M      5  2023\n",
       "31679     Zymirr   M      5  2023\n",
       "31680     Zyquan   M      5  2023\n",
       "31681      Zyrin   M      5  2023\n",
       "\n",
       "[2117219 rows x 4 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import urllib.request\n",
    "import os.path\n",
    "import pandas as pd \n",
    "\n",
    "# Download data from the web directly\n",
    "data_url = \"https://www.ssa.gov/oact/babynames/names.zip\"\n",
    "local_filename = \"../data/names.zip\"\n",
    "if not os.path.exists(local_filename): # if the data exists don't download again\n",
    "    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:\n",
    "        f.write(resp.read())\n",
    "\n",
    "        \n",
    "# Load data without unzipping the file\n",
    "import zipfile\n",
    "names = [] \n",
    "with zipfile.ZipFile(local_filename, \"r\") as zf:\n",
    "    data_files = [f for f in zf.filelist if f.filename[-3:] == \"txt\"]\n",
    "    def extract_year_from_filename(fn):\n",
    "        return int(fn[3:7])\n",
    "    for f in data_files:\n",
    "        year = extract_year_from_filename(f.filename)\n",
    "        with zf.open(f) as fp:\n",
    "            df = pd.read_csv(fp, names=[\"Name\", \"Sex\", \"Count\"])\n",
    "            df[\"Year\"] = year\n",
    "            names.append(df)\n",
    "names = pd.concat(names)\n",
    "\n",
    "names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "names.to_csv(\"../data/names.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we wanted to aggregate all rows in `names` for a given year, we would need `names.groupby(\"Year\")`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7a30b64bb0>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names.groupby(\"Year\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does this strange output mean? Calling [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) has generated a `GroupBy` object. You can imagine this as a set of \"mini\" sub-DataFrames, where each subframe contains all of the rows from `names` that correspond to a particular year. \n",
    "\n",
    "The diagram below shows a simplified view of `names` to help illustrate this idea.\n",
    "\n",
    "![Creating a GroupBy object](../assets/gb.png)\n",
    "\n",
    "We can't work with a `GroupBy` object directly – that is why you saw that strange output earlier, rather than a standard view of a DataFrame. To actually manipulate values within these \"mini\" DataFrames, we'll need to call an *aggregation method*. This is a method that tells `pandas` how to aggregate the values within the `GroupBy` object. Once the aggregation is applied, `pandas` will return a normal (now grouped) DataFrame.\n",
    "\n",
    "Aggregation functions (`.min()`, `.max()`, `.mean()`, `.sum()`, etc.) are the most common way to work with `GroupBy` objects. These functions are applied to each column of a \"mini\" grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let's see this in action by finding the `sum` of all counts for each year in `names` – this is equivalent to finding the number of babies born in each year. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1880</th>\n",
       "      <td>201484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1881</th>\n",
       "      <td>192690</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1882</th>\n",
       "      <td>221533</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1883</th>\n",
       "      <td>216944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1884</th>\n",
       "      <td>243461</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Count\n",
       "Year        \n",
       "1880  201484\n",
       "1881  192690\n",
       "1882  221533\n",
       "1883  216944\n",
       "1884  243461"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names.groupby(\"Year\").sum().head(5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of `names`, which is why we see smaller values for the summed counts.\n",
    "\n",
    "![Performing an aggregation](https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/agg.png)\n",
    "\n",
    "Calling `.agg` has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by `\"Year\"`, with a single row for each unique year in the original `names` DataFrame.\n",
    "\n",
    "You may be wondering: where did the `\"State\"`, `\"Sex\"`, and `\"Name\"` columns go? Logically, it doesn't make sense to `sum` the string data in these columns (how would we add \"Mary\" + \"Ann\"?). Because of this, `pandas` will simply omit these columns when it performs the aggregation on the DataFrame. Since this happens implicitly, without the user specifying that these columns should be ignored, it's easy to run into troubling situations where columns are removed without the programmer noticing. It is better coding practice to select *only* the columns we care about before performing the aggregation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1880</th>\n",
       "      <td>201484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1881</th>\n",
       "      <td>192690</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1882</th>\n",
       "      <td>221533</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1883</th>\n",
       "      <td>216944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1884</th>\n",
       "      <td>243461</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Count\n",
       "Year        \n",
       "1880  201484\n",
       "1881  192690\n",
       "1882  221533\n",
       "1883  216944\n",
       "1884  243461"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same result, but now we explicitly tell pandas to only consider the \"Count\" column when summing\n",
    "names.groupby(\"Year\")[[\"Count\"]].sum().head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:\n",
    "\n",
    "* Take in a `Series` of data (a single column of the grouped subframe)\n",
    "* Return a single value that aggregates this `Series`\n",
    "\n",
    "Because of this fairly broad requirement, `pandas` offers many ways of computing an aggregation.\n",
    "\n",
    "**In-built** Python operations – such as `sum`, `max`, and `min` – are automatically recognized by `pandas`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aaban</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabha</th>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabid</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabidah</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabir</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Count\n",
       "Name          \n",
       "Aaban       16\n",
       "Aabha        9\n",
       "Aabid        6\n",
       "Aabidah      5\n",
       "Aabir        5"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the maximum count for each name in any year?\n",
    "names.groupby(\"Name\")[[\"Count\"]].max().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aaban</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabha</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabid</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabidah</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabir</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Count\n",
       "Name          \n",
       "Aaban        5\n",
       "Aabha        5\n",
       "Aabid        5\n",
       "Aabidah      5\n",
       "Aabir        5"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the minimum count for each name in any year?\n",
    "names.groupby(\"Name\")[[\"Count\"]].min().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aaban</th>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabha</th>\n",
       "      <td>6.375000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabid</th>\n",
       "      <td>5.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabidah</th>\n",
       "      <td>5.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabir</th>\n",
       "      <td>5.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Count\n",
       "Name              \n",
       "Aaban    10.000000\n",
       "Aabha     6.375000\n",
       "Aabid     5.333333\n",
       "Aabidah   5.000000\n",
       "Aabir     5.000000"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What is the average count for each name across all years?\n",
    "names.groupby(\"Name\")[[\"Count\"]].mean().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas` also offers a number of in-built functions for aggregation. Some examples include:\n",
    "\n",
    "* `.sum()`\n",
    "* `.max()`\n",
    "* `.min()`\n",
    "* `.mean()`\n",
    "* `.first()`\n",
    "* `.last()`\n",
    "\n",
    "The latter two entries in this list – `\"first\"` and `\"last\"` – are unique to `pandas`. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where *multiple* columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.\n",
    "\n",
    "Let's illustrate this with an example. Say we add a new column to `names` that contains the first letter of each name. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>First Letter</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>M</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Anna</td>\n",
       "      <td>A</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emma</td>\n",
       "      <td>E</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>E</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Minnie</td>\n",
       "      <td>M</td>\n",
       "      <td>1880</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Name First Letter  Year\n",
       "0       Mary            M  1880\n",
       "1       Anna            A  1880\n",
       "2       Emma            E  1880\n",
       "3  Elizabeth            E  1880\n",
       "4     Minnie            M  1880"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Imagine we had an additional column, \"First Letter\". We'll explain this code next week\n",
    "names[\"First Letter\"] = names[\"Name\"].apply(lambda x: x[0])\n",
    "\n",
    "# We construct a simplified DataFrame containing just a subset of columns\n",
    "names_new = names[[\"Name\", \"First Letter\", \"Year\"]]\n",
    "names_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we form groups for each name in the dataset, `\"First Letter\"` will be the same for all members of the group. This means that if we simply select the first entry for `\"First Letter\"` in the group, we'll represent all data in that group. \n",
    "\n",
    "We can use a dictionary to apply different aggregation functions to each column during grouping.\n",
    "\n",
    "![Aggregating using \"first\"](../assets/first.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Letter</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aaban</th>\n",
       "      <td>A</td>\n",
       "      <td>2019</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabha</th>\n",
       "      <td>A</td>\n",
       "      <td>2021</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabid</th>\n",
       "      <td>A</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabidah</th>\n",
       "      <td>A</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabir</th>\n",
       "      <td>A</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        First Letter  Year\n",
       "Name                      \n",
       "Aaban              A  2019\n",
       "Aabha              A  2021\n",
       "Aabid              A  2018\n",
       "Aabidah            A  2018\n",
       "Aabir              A  2018"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names_new.groupby(\"Name\").agg({\"First Letter\":\"first\", \"Year\":\"max\"}).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also define aggregation functions of our own! This can be done using either a `def` or `lambda` statement. Again, the condition for a custom aggregation function is that it must take in a `Series` and output a single scalar value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'names' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Input \u001b[0;32mIn [2]\u001b[0m, in \u001b[0;36m<cell line: 4>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mratio_to_peak\u001b[39m(series):\n\u001b[1;32m      2\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m series\u001b[38;5;241m.\u001b[39miloc[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m1\u001b[39m]\u001b[38;5;241m/\u001b[39m\u001b[38;5;28mmax\u001b[39m(series)\n\u001b[0;32m----> 4\u001b[0m \u001b[43mnames\u001b[49m\u001b[38;5;241m.\u001b[39mgroupby(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mName\u001b[39m\u001b[38;5;124m\"\u001b[39m)[[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mYear\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCount\u001b[39m\u001b[38;5;124m\"\u001b[39m]]\u001b[38;5;241m.\u001b[39mapply(ratio_to_peak)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'names' is not defined"
     ]
    }
   ],
   "source": [
    "def ratio_to_peak(series):\n",
    "    return series.iloc[-1]/max(series)\n",
    "\n",
    "names.groupby(\"Name\")[[\"Year\", \"Count\"]].apply(ratio_to_peak)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{note}\n",
    "`lambda` functions are a special type of function that can be defined in a single line. They are also often refered to as _\"anonymous\"_ functions because these functions don't have a name. They are useful for simple functions that are not used elsewhere in your code.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Name</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Aaban</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.375000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabha</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.555556</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabid</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabidah</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Aabir</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Zyvion</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Zyvon</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Zyyanna</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Zyyon</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Zzyzx</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>101338 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Year     Count\n",
       "Name                   \n",
       "Aaban     1.0  0.375000\n",
       "Aabha     1.0  0.555556\n",
       "Aabid     1.0  1.000000\n",
       "Aabidah   1.0  1.000000\n",
       "Aabir     1.0  1.000000\n",
       "...       ...       ...\n",
       "Zyvion    1.0  1.000000\n",
       "Zyvon     1.0  1.000000\n",
       "Zyyanna   1.0  1.000000\n",
       "Zyyon     1.0  1.000000\n",
       "Zzyzx     1.0  1.000000\n",
       "\n",
       "[101338 rows x 2 columns]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Alternatively, using lambda\n",
    "names.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(lambda s: s.iloc[-1]/max(s))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregation with `lambda` Functions\n",
    "\n",
    "We'll work with the `elections` DataFrame again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1824</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic-Republican</td>\n",
       "      <td>151271</td>\n",
       "      <td>loss</td>\n",
       "      <td>57.210122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1824</td>\n",
       "      <td>John Quincy Adams</td>\n",
       "      <td>Democratic-Republican</td>\n",
       "      <td>113142</td>\n",
       "      <td>win</td>\n",
       "      <td>42.789878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1828</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>642806</td>\n",
       "      <td>win</td>\n",
       "      <td>56.203927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1828</td>\n",
       "      <td>John Quincy Adams</td>\n",
       "      <td>National Republican</td>\n",
       "      <td>500897</td>\n",
       "      <td>loss</td>\n",
       "      <td>43.796073</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1832</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>702735</td>\n",
       "      <td>win</td>\n",
       "      <td>54.574789</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Year          Candidate                  Party  Popular vote Result  \\\n",
       "0  1824     Andrew Jackson  Democratic-Republican        151271   loss   \n",
       "1  1824  John Quincy Adams  Democratic-Republican        113142    win   \n",
       "2  1828     Andrew Jackson             Democratic        642806    win   \n",
       "3  1828  John Quincy Adams    National Republican        500897   loss   \n",
       "4  1832     Andrew Jackson             Democratic        702735    win   \n",
       "\n",
       "           %  \n",
       "0  57.210122  \n",
       "1  42.789878  \n",
       "2  56.203927  \n",
       "3  43.796073  \n",
       "4  54.574789  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "url = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv\"\n",
    "elections = pd.read_csv(url)\n",
    "elections.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we wish to aggregate our DataFrame using a non-standard function – for example, a function of our own design? We can do so by combining `.agg` with `lambda` expressions.\n",
    "\n",
    "Let's first consider a puzzle to jog our memory. We will attempt to find the `Candidate` from each `Party` with the highest `%` of votes. \n",
    "\n",
    "A naive approach may be to group by the `Party` column and aggregate by the maximum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Party</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>American</th>\n",
       "      <td>1976</td>\n",
       "      <td>Thomas J. Anderson</td>\n",
       "      <td>873053</td>\n",
       "      <td>loss</td>\n",
       "      <td>21.554001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>American Independent</th>\n",
       "      <td>1976</td>\n",
       "      <td>Lester Maddox</td>\n",
       "      <td>9901118</td>\n",
       "      <td>loss</td>\n",
       "      <td>13.571218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Anti-Masonic</th>\n",
       "      <td>1832</td>\n",
       "      <td>William Wirt</td>\n",
       "      <td>100715</td>\n",
       "      <td>loss</td>\n",
       "      <td>7.821583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Anti-Monopoly</th>\n",
       "      <td>1884</td>\n",
       "      <td>Benjamin Butler</td>\n",
       "      <td>134294</td>\n",
       "      <td>loss</td>\n",
       "      <td>1.335838</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Citizens</th>\n",
       "      <td>1980</td>\n",
       "      <td>Barry Commoner</td>\n",
       "      <td>233052</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.270182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Communist</th>\n",
       "      <td>1932</td>\n",
       "      <td>William Z. Foster</td>\n",
       "      <td>103307</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.261069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Constitution</th>\n",
       "      <td>2016</td>\n",
       "      <td>Michael Peroutka</td>\n",
       "      <td>203091</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.152398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Constitutional Union</th>\n",
       "      <td>1860</td>\n",
       "      <td>John Bell</td>\n",
       "      <td>590901</td>\n",
       "      <td>loss</td>\n",
       "      <td>12.639283</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Democratic</th>\n",
       "      <td>2020</td>\n",
       "      <td>Woodrow Wilson</td>\n",
       "      <td>81268924</td>\n",
       "      <td>win</td>\n",
       "      <td>61.344703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Democratic-Republican</th>\n",
       "      <td>1824</td>\n",
       "      <td>John Quincy Adams</td>\n",
       "      <td>151271</td>\n",
       "      <td>win</td>\n",
       "      <td>57.210122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Year           Candidate  Popular vote Result  \\\n",
       "Party                                                                  \n",
       "American               1976  Thomas J. Anderson        873053   loss   \n",
       "American Independent   1976       Lester Maddox       9901118   loss   \n",
       "Anti-Masonic           1832        William Wirt        100715   loss   \n",
       "Anti-Monopoly          1884     Benjamin Butler        134294   loss   \n",
       "Citizens               1980      Barry Commoner        233052   loss   \n",
       "Communist              1932   William Z. Foster        103307   loss   \n",
       "Constitution           2016    Michael Peroutka        203091   loss   \n",
       "Constitutional Union   1860           John Bell        590901   loss   \n",
       "Democratic             2020      Woodrow Wilson      81268924    win   \n",
       "Democratic-Republican  1824   John Quincy Adams        151271    win   \n",
       "\n",
       "                               %  \n",
       "Party                             \n",
       "American               21.554001  \n",
       "American Independent   13.571218  \n",
       "Anti-Masonic            7.821583  \n",
       "Anti-Monopoly           1.335838  \n",
       "Citizens                0.270182  \n",
       "Communist               0.261069  \n",
       "Constitution            0.152398  \n",
       "Constitutional Union   12.639283  \n",
       "Democratic             61.344703  \n",
       "Democratic-Republican  57.210122  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elections.groupby(\"Party\").agg(max).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This approach is clearly wrong – the DataFrame claims that Woodrow Wilson won the presidency in 2020.\n",
    "\n",
    "Why is this happening? Here, the `max` aggregation function is taken over every column *independently*. Among Democrats, `max` is computing:\n",
    "\n",
    "- The most recent `Year` a Democratic candidate ran for president (2020)\n",
    "- The `Candidate` with the alphabetically \"largest\" name (\"Woodrow Wilson\")\n",
    "- The `Result` with the alphabetically \"largest\" outcome (\"win\")\n",
    "\n",
    "Instead, let's try a different approach. We will:\n",
    "\n",
    "1. Sort the DataFrame so that rows are in descending order of `%`\n",
    "2. Group by `Party` and select the first row of each sub-DataFrame\n",
    "\n",
    "While it may seem unintuitive, sorting `elections` by descending order of `%` is extremely helpful. If we then group by `Party`, the first row of each groupby object will contain information about the `Candidate` with the highest voter `%`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>114</th>\n",
       "      <td>1964</td>\n",
       "      <td>Lyndon Johnson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>43127041</td>\n",
       "      <td>win</td>\n",
       "      <td>61.344703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>1936</td>\n",
       "      <td>Franklin Roosevelt</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>27752648</td>\n",
       "      <td>win</td>\n",
       "      <td>60.978107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <td>1972</td>\n",
       "      <td>Richard Nixon</td>\n",
       "      <td>Republican</td>\n",
       "      <td>47168710</td>\n",
       "      <td>win</td>\n",
       "      <td>60.907806</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>1920</td>\n",
       "      <td>Warren Harding</td>\n",
       "      <td>Republican</td>\n",
       "      <td>16144093</td>\n",
       "      <td>win</td>\n",
       "      <td>60.574501</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>133</th>\n",
       "      <td>1984</td>\n",
       "      <td>Ronald Reagan</td>\n",
       "      <td>Republican</td>\n",
       "      <td>54455472</td>\n",
       "      <td>win</td>\n",
       "      <td>59.023326</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year           Candidate       Party  Popular vote Result          %\n",
       "114  1964      Lyndon Johnson  Democratic      43127041    win  61.344703\n",
       "91   1936  Franklin Roosevelt  Democratic      27752648    win  60.978107\n",
       "120  1972       Richard Nixon  Republican      47168710    win  60.907806\n",
       "79   1920      Warren Harding  Republican      16144093    win  60.574501\n",
       "133  1984       Ronald Reagan  Republican      54455472    win  59.023326"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elections_sorted_by_percent = elections.sort_values(\"%\", ascending=False)\n",
    "elections_sorted_by_percent.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Party</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>American</th>\n",
       "      <td>1856</td>\n",
       "      <td>Millard Fillmore</td>\n",
       "      <td>873053</td>\n",
       "      <td>loss</td>\n",
       "      <td>21.554001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>American Independent</th>\n",
       "      <td>1968</td>\n",
       "      <td>George Wallace</td>\n",
       "      <td>9901118</td>\n",
       "      <td>loss</td>\n",
       "      <td>13.571218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Anti-Masonic</th>\n",
       "      <td>1832</td>\n",
       "      <td>William Wirt</td>\n",
       "      <td>100715</td>\n",
       "      <td>loss</td>\n",
       "      <td>7.821583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Anti-Monopoly</th>\n",
       "      <td>1884</td>\n",
       "      <td>Benjamin Butler</td>\n",
       "      <td>134294</td>\n",
       "      <td>loss</td>\n",
       "      <td>1.335838</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Citizens</th>\n",
       "      <td>1980</td>\n",
       "      <td>Barry Commoner</td>\n",
       "      <td>233052</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.270182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Communist</th>\n",
       "      <td>1932</td>\n",
       "      <td>William Z. Foster</td>\n",
       "      <td>103307</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.261069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Constitution</th>\n",
       "      <td>2008</td>\n",
       "      <td>Chuck Baldwin</td>\n",
       "      <td>199750</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.152398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Constitutional Union</th>\n",
       "      <td>1860</td>\n",
       "      <td>John Bell</td>\n",
       "      <td>590901</td>\n",
       "      <td>loss</td>\n",
       "      <td>12.639283</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Democratic</th>\n",
       "      <td>1964</td>\n",
       "      <td>Lyndon Johnson</td>\n",
       "      <td>43127041</td>\n",
       "      <td>win</td>\n",
       "      <td>61.344703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Democratic-Republican</th>\n",
       "      <td>1824</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>151271</td>\n",
       "      <td>loss</td>\n",
       "      <td>57.210122</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       Year          Candidate  Popular vote Result          %\n",
       "Party                                                                         \n",
       "American               1856   Millard Fillmore        873053   loss  21.554001\n",
       "American Independent   1968     George Wallace       9901118   loss  13.571218\n",
       "Anti-Masonic           1832       William Wirt        100715   loss   7.821583\n",
       "Anti-Monopoly          1884    Benjamin Butler        134294   loss   1.335838\n",
       "Citizens               1980     Barry Commoner        233052   loss   0.270182\n",
       "Communist              1932  William Z. Foster        103307   loss   0.261069\n",
       "Constitution           2008      Chuck Baldwin        199750   loss   0.152398\n",
       "Constitutional Union   1860          John Bell        590901   loss  12.639283\n",
       "Democratic             1964     Lyndon Johnson      43127041    win  61.344703\n",
       "Democratic-Republican  1824     Andrew Jackson        151271   loss  57.210122"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elections_sorted_by_percent.groupby(\"Party\").agg(lambda x : x.iloc[0]).head(10)\n",
    "\n",
    "# Equivalent to the below code\n",
    "# elections_sorted_by_percent.groupby(\"Party\").agg('first').head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an illustration of the process:\n",
    "\n",
    "<img src=\"../assets/puzzle_demo.png\" alt='groupby_demo' width='600'>\n",
    "\n",
    "Notice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter `%`.\n",
    "\n",
    "More generally, `lambda` functions are used to design custom aggregation functions that aren't pre-defined by Python. The input parameter `x` to the `lambda` function is a `GroupBy` object. Therefore, it should make sense why `lambda x : x.iloc[0]` selects the first row in each groupby object.\n",
    "\n",
    "\n",
    "In fact, there's a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We've given a few examples below. \n",
    "\n",
    "**Note**: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in `pandas`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>1856</td>\n",
       "      <td>Millard Fillmore</td>\n",
       "      <td>American</td>\n",
       "      <td>873053</td>\n",
       "      <td>loss</td>\n",
       "      <td>21.554001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115</th>\n",
       "      <td>1968</td>\n",
       "      <td>George Wallace</td>\n",
       "      <td>American Independent</td>\n",
       "      <td>9901118</td>\n",
       "      <td>loss</td>\n",
       "      <td>13.571218</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1832</td>\n",
       "      <td>William Wirt</td>\n",
       "      <td>Anti-Masonic</td>\n",
       "      <td>100715</td>\n",
       "      <td>loss</td>\n",
       "      <td>7.821583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>1884</td>\n",
       "      <td>Benjamin Butler</td>\n",
       "      <td>Anti-Monopoly</td>\n",
       "      <td>134294</td>\n",
       "      <td>loss</td>\n",
       "      <td>1.335838</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>1980</td>\n",
       "      <td>Barry Commoner</td>\n",
       "      <td>Citizens</td>\n",
       "      <td>233052</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.270182</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year         Candidate                 Party  Popular vote Result  \\\n",
       "22   1856  Millard Fillmore              American        873053   loss   \n",
       "115  1968    George Wallace  American Independent       9901118   loss   \n",
       "6    1832      William Wirt          Anti-Masonic        100715   loss   \n",
       "38   1884   Benjamin Butler         Anti-Monopoly        134294   loss   \n",
       "127  1980    Barry Commoner              Citizens        233052   loss   \n",
       "\n",
       "             %  \n",
       "22   21.554001  \n",
       "115  13.571218  \n",
       "6     7.821583  \n",
       "38    1.335838  \n",
       "127   0.270182  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using the idxmax function\n",
    "best_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]\n",
    "best_per_party.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>1996</td>\n",
       "      <td>John Hagelin</td>\n",
       "      <td>Natural Law</td>\n",
       "      <td>113670</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.118219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>164</th>\n",
       "      <td>2008</td>\n",
       "      <td>Chuck Baldwin</td>\n",
       "      <td>Constitution</td>\n",
       "      <td>199750</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.152398</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>110</th>\n",
       "      <td>1956</td>\n",
       "      <td>T. Coleman Andrews</td>\n",
       "      <td>States' Rights</td>\n",
       "      <td>107929</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.174883</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>1996</td>\n",
       "      <td>Howard Phillips</td>\n",
       "      <td>Taxpayers</td>\n",
       "      <td>184656</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.192045</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>136</th>\n",
       "      <td>1988</td>\n",
       "      <td>Lenora Fulani</td>\n",
       "      <td>New Alliance</td>\n",
       "      <td>217221</td>\n",
       "      <td>loss</td>\n",
       "      <td>0.237804</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year           Candidate           Party  Popular vote Result         %\n",
       "148  1996        John Hagelin     Natural Law        113670   loss  0.118219\n",
       "164  2008       Chuck Baldwin    Constitution        199750   loss  0.152398\n",
       "110  1956  T. Coleman Andrews  States' Rights        107929   loss  0.174883\n",
       "147  1996     Howard Phillips       Taxpayers        184656   loss  0.192045\n",
       "136  1988       Lenora Fulani    New Alliance        217221   loss  0.237804"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using the .drop_duplicates function\n",
    "best_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')\n",
    "best_per_party2.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other `GroupBy` Features\n",
    "\n",
    "There are many aggregation methods we can use with `.agg`. Some useful options are:\n",
    "\n",
    "* [`.mean`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html#pandas.core.groupby.DataFrameGroupBy.mean): creates a new DataFrame with the mean value of each group\n",
    "* [`.sum`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html#pandas.core.groupby.DataFrameGroupBy.sum): creates a new DataFrame with the sum of each group\n",
    "* [`.max`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.max.html#pandas.core.groupby.DataFrameGroupBy.max) and [`.min`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.min.html#pandas.core.groupby.DataFrameGroupBy.min): creates a new DataFrame with the maximum/minimum value of each group\n",
    "* [`.first`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html#pandas.core.groupby.DataFrameGroupBy.first) and [`.last`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.last.html#pandas.core.groupby.DataFrameGroupBy.last): creates a new DataFrame with the first/last row in each group\n",
    "* [`.size`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.size.html#pandas.core.groupby.DataFrameGroupBy.size): creates a new **Series** with the number of entries in each group\n",
    "* [`.count`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html#pandas.core.groupby.DataFrameGroupBy.count): creates a new **DataFrame** with the number of entries, excluding missing values. \n",
    "\n",
    "Note the slight difference between `.size()` and `.count()`: while `.size()` returns a Series and counts the number of entries including the missing values, `.count()` returns a DataFrame and counts the number of entries in each column excluding missing values. Here's an example:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>letter</th>\n",
       "      <th>num</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A</td>\n",
       "      <td>1.0</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A</td>\n",
       "      <td>2.0</td>\n",
       "      <td>tx</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>3.0</td>\n",
       "      <td>fl</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>4.0</td>\n",
       "      <td>hi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>C</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C</td>\n",
       "      <td>4.0</td>\n",
       "      <td>ak</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  letter  num state\n",
       "0      A  1.0  None\n",
       "1      A  2.0    tx\n",
       "2      B  3.0    fl\n",
       "3      C  4.0    hi\n",
       "4      C  NaN  None\n",
       "5      C  4.0    ak"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'letter':['A','A','B','C','C','C'], \n",
    "                   'num':[1,2,3,4,None,4], \n",
    "                   'state':[None, 'tx', 'fl', 'hi', None, 'ak']})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "letter\n",
       "A    2\n",
       "B    1\n",
       "C    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"letter\").size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>num</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>letter</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>A</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>C</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        num  state\n",
       "letter            \n",
       "A         2      1\n",
       "B         1      1\n",
       "C         2      2"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(\"letter\").count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might recall that the `value_counts()` function in the previous note does something similar. It turns out `value_counts()` and `groupby.size()` are the same, except `value_counts()` sorts the resulting Series in descending order automatically. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "C    3\n",
       "A    2\n",
       "B    1\n",
       "Name: letter, dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"letter\"].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "hese (and other) aggregation functions are so common that `pandas` allows for writing shorthand. Instead of explicitly stating the use of `.agg`, we can call the function directly on the `GroupBy` object.\n",
    "\n",
    "For example, the following are equivalent:\n",
    "\n",
    "- `elections.groupby(\"Candidate\").agg(mean)` \n",
    "- `elections.groupby(\"Candidate\").mean()`\n",
    "\n",
    "There are many other methods that `pandas` supports. You can check them out on the [`pandas` documentation](https://pandas.pydata.org/docs/reference/groupby.html).\n",
    "\n",
    "\n",
    "\n",
    "<br/>\n",
    "\n",
    "<img src=\"../assets/groupby_2.png\" alt='groupby_demo' width='100%'>\n",
    "\n",
    "### Filtering by Group\n",
    "\n",
    "Another common use for `GroupBy` objects is to filter data by group. \n",
    "\n",
    "`groupby.filter` takes an argument $\\text{f}$, where $\\text{f}$ is a function that:\n",
    "\n",
    "- Takes a DataFrame object as input\n",
    "- Returns a single `True` or `False` for the each sub-DataFrame\n",
    "\n",
    "Sub-DataFrames that correspond to `True` are returned in the final result, whereas those with a `False` value are not. Importantly, `groupby.filter` is different from `groupby.agg` in that an *entire* sub-DataFrame is returned in the final DataFrame, not just a single row. As a result, `groupby.filter` preserves the original indices.\n",
    "\n",
    "To illustrate how this happens, consider the following `.filter` function applied on some arbitrary data. Say we want to identify \"tight\" election years – that is, we want to find all rows that correspond to elections years where all candidates in that year won a similar portion of the total vote. Specifically, let's find all rows corresponding to a year where no candidate won more than 45% of the total vote. \n",
    "\n",
    "In other words, we want to: \n",
    "\n",
    "- Find the years where the maximum `%` in that year is less than 45%\n",
    "- Return all DataFrame rows that correspond to these years\n",
    "\n",
    "For each year, we need to find the maximum `%` among *all* rows for that year. If this maximum `%` is lower than 45%, we will tell `pandas` to keep all rows corresponding to that year. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1860</td>\n",
       "      <td>Abraham Lincoln</td>\n",
       "      <td>Republican</td>\n",
       "      <td>1855993</td>\n",
       "      <td>win</td>\n",
       "      <td>39.699408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>1860</td>\n",
       "      <td>John Bell</td>\n",
       "      <td>Constitutional Union</td>\n",
       "      <td>590901</td>\n",
       "      <td>loss</td>\n",
       "      <td>12.639283</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>1860</td>\n",
       "      <td>John C. Breckinridge</td>\n",
       "      <td>Southern Democratic</td>\n",
       "      <td>848019</td>\n",
       "      <td>loss</td>\n",
       "      <td>18.138998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>1860</td>\n",
       "      <td>Stephen A. Douglas</td>\n",
       "      <td>Northern Democratic</td>\n",
       "      <td>1380202</td>\n",
       "      <td>loss</td>\n",
       "      <td>29.522311</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66</th>\n",
       "      <td>1912</td>\n",
       "      <td>Eugene V. Debs</td>\n",
       "      <td>Socialist</td>\n",
       "      <td>901551</td>\n",
       "      <td>loss</td>\n",
       "      <td>6.004354</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>1912</td>\n",
       "      <td>Eugene W. Chafin</td>\n",
       "      <td>Prohibition</td>\n",
       "      <td>208156</td>\n",
       "      <td>loss</td>\n",
       "      <td>1.386325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>1912</td>\n",
       "      <td>Theodore Roosevelt</td>\n",
       "      <td>Progressive</td>\n",
       "      <td>4122721</td>\n",
       "      <td>loss</td>\n",
       "      <td>27.457433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>1912</td>\n",
       "      <td>William Taft</td>\n",
       "      <td>Republican</td>\n",
       "      <td>3486242</td>\n",
       "      <td>loss</td>\n",
       "      <td>23.218466</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>1912</td>\n",
       "      <td>Woodrow Wilson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>6296284</td>\n",
       "      <td>win</td>\n",
       "      <td>41.933422</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Year             Candidate                 Party  Popular vote Result  \\\n",
       "23  1860       Abraham Lincoln            Republican       1855993    win   \n",
       "24  1860             John Bell  Constitutional Union        590901   loss   \n",
       "25  1860  John C. Breckinridge   Southern Democratic        848019   loss   \n",
       "26  1860    Stephen A. Douglas   Northern Democratic       1380202   loss   \n",
       "66  1912        Eugene V. Debs             Socialist        901551   loss   \n",
       "67  1912      Eugene W. Chafin           Prohibition        208156   loss   \n",
       "68  1912    Theodore Roosevelt           Progressive       4122721   loss   \n",
       "69  1912          William Taft            Republican       3486242   loss   \n",
       "70  1912        Woodrow Wilson            Democratic       6296284    win   \n",
       "\n",
       "            %  \n",
       "23  39.699408  \n",
       "24  12.639283  \n",
       "25  18.138998  \n",
       "26  29.522311  \n",
       "66   6.004354  \n",
       "67   1.386325  \n",
       "68  27.457433  \n",
       "69  23.218466  \n",
       "70  41.933422  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "elections.groupby(\"Year\").filter(lambda sf: sf[\"%\"].max() < 45).head(9)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's going on here? In this example, we've defined our filtering function, $\\text{f}$, to be `lambda sf: sf[\"%\"].max() < 45`. This filtering function will find the maximum `\"%\"` value among all entries in the grouped sub-DataFrame, which we call `sf`. If the maximum value is less than 45, then the filter function will return `True` and all rows in that grouped sub-DataFrame will appear in the final output DataFrame. \n",
    "\n",
    "Examine the DataFrame above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote. \n",
    "\n",
    "You may ask: how is the `groupby.filter` procedure different to the boolean filtering we've seen previously? Boolean filtering considers *individual* rows when applying a boolean condition. For example, the code `elections[elections[\"%\"] < 45]` will check the `\"%\"` value of every single row in `elections`; if it is less than 45, then that row will be kept in the output. `groupby.filter`, in contrast, applies a boolean condition *across* all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output. \n",
    "\n",
    "## Aggregation with `.pivot_table`\n",
    "\n",
    "We know now that `.groupby` gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It's possible to group by multiple columns at once by passing in a list of column names to `.groupby`. \n",
    "\n",
    "\n",
    "Let's consider the `names` dataset. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we'll group by *both* the `\"Year\"` and `\"Sex\"` columns.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Count</th>\n",
       "      <th>Year</th>\n",
       "      <th>First Letter</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Mary</td>\n",
       "      <td>F</td>\n",
       "      <td>7065</td>\n",
       "      <td>1880</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Anna</td>\n",
       "      <td>F</td>\n",
       "      <td>2604</td>\n",
       "      <td>1880</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Emma</td>\n",
       "      <td>F</td>\n",
       "      <td>2003</td>\n",
       "      <td>1880</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>F</td>\n",
       "      <td>1939</td>\n",
       "      <td>1880</td>\n",
       "      <td>E</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Minnie</td>\n",
       "      <td>F</td>\n",
       "      <td>1746</td>\n",
       "      <td>1880</td>\n",
       "      <td>M</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Name Sex  Count  Year First Letter\n",
       "0       Mary   F   7065  1880            M\n",
       "1       Anna   F   2604  1880            A\n",
       "2       Emma   F   2003  1880            E\n",
       "3  Elizabeth   F   1939  1880            E\n",
       "4     Minnie   F   1746  1880            M"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th>Sex</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1880</th>\n",
       "      <th>F</th>\n",
       "      <td>90994</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>110490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1881</th>\n",
       "      <th>F</th>\n",
       "      <td>91953</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>100737</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1882</th>\n",
       "      <th>F</th>\n",
       "      <td>107847</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>113686</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Count\n",
       "Year Sex        \n",
       "1880 F     90994\n",
       "     M    110490\n",
       "1881 F     91953\n",
       "     M    100737\n",
       "1882 F    107847\n",
       "     M    113686"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Find the total number of baby names associated with each sex for each year in the data\n",
    "names.groupby([\"Year\", \"Sex\"])[[\"Count\"]].sum().head(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that both `\"Year\"` and `\"Sex\"` serve as the index of the DataFrame (they are both rendered in bold). We've created a *multi-index* DataFrame where two different index values, the year and sex, are used to uniquely identify each row. \n",
    "\n",
    "This isn't the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use. \n",
    "\n",
    "Another strategy to aggregate across two columns is to create a pivot table. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.\n",
    "\n",
    "The best way to understand pivot tables is to see one in action. Let's return to our original goal of summing the total number of names associated with each combination of year and sex. We'll call the `pandas` [`.pivot_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) method to create a new table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Sex</th>\n",
       "      <th>F</th>\n",
       "      <th>M</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1880</th>\n",
       "      <td>90994</td>\n",
       "      <td>110490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1881</th>\n",
       "      <td>91953</td>\n",
       "      <td>100737</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1882</th>\n",
       "      <td>107847</td>\n",
       "      <td>113686</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1883</th>\n",
       "      <td>112319</td>\n",
       "      <td>104625</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1884</th>\n",
       "      <td>129019</td>\n",
       "      <td>114442</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Sex        F       M\n",
       "Year                \n",
       "1880   90994  110490\n",
       "1881   91953  100737\n",
       "1882  107847  113686\n",
       "1883  112319  104625\n",
       "1884  129019  114442"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The `pivot_table` method is used to generate a Pandas pivot table\n",
    "names.pivot_table(\n",
    "    index = \"Year\", \n",
    "    columns = \"Sex\", \n",
    "    values = \"Count\", \n",
    "    aggfunc = sum).head(5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of `\"Year\"` and `\"Sex\"`.\n",
    "\n",
    "Let's take a closer look at the code implemented above. \n",
    "\n",
    "* `index = \"Year\"` specifies the column name in the original DataFrame that should be used as the index of the pivot table\n",
    "* `columns = \"Sex\"` specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table\n",
    "* `values = \"Count\"` indicates what values from the original DataFrame should be used to populate the entry for each index-column combination\n",
    "* `aggfunc = sum` tells `pandas` what function to use when aggregating the data specified by `values`. Here, we are summing the name counts for each pair of `\"Year\"` and `\"Sex\"`\n",
    "\n",
    "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/pivot.png\n",
    "---\n",
    "height: 400px\n",
    "name: pivot_table\n",
    "---\n",
    "A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of the aggregation function. A pivot table is itself a DataFrame, where the rows represent one variable that you're interested in, the columns another, and the cell's some aggregate value. A pivot table also tends to include marginal values (like sums) for each row and column.\n",
    "```\n",
    "\n",
    "We can even include multiple values in the index or columns of our pivot tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">Count</th>\n",
       "      <th colspan=\"2\" halign=\"left\">Name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Sex</th>\n",
       "      <th>F</th>\n",
       "      <th>M</th>\n",
       "      <th>F</th>\n",
       "      <th>M</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1880</th>\n",
       "      <td>7065</td>\n",
       "      <td>9655</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zeke</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1881</th>\n",
       "      <td>6919</td>\n",
       "      <td>8769</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zeb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1882</th>\n",
       "      <td>8148</td>\n",
       "      <td>9557</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1883</th>\n",
       "      <td>8012</td>\n",
       "      <td>8894</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zeno</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1884</th>\n",
       "      <td>9217</td>\n",
       "      <td>9388</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zollie</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1885</th>\n",
       "      <td>9128</td>\n",
       "      <td>8756</td>\n",
       "      <td>Zula</td>\n",
       "      <td>Zollie</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Count        Name        \n",
       "Sex      F     M     F       M\n",
       "Year                          \n",
       "1880  7065  9655  Zula    Zeke\n",
       "1881  6919  8769  Zula     Zeb\n",
       "1882  8148  9557  Zula     Zed\n",
       "1883  8012  8894  Zula    Zeno\n",
       "1884  9217  9388  Zula  Zollie\n",
       "1885  9128  8756  Zula  Zollie"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names_pivot = names.pivot_table(\n",
    "    index=\"Year\",     # the rows (turned into index)\n",
    "    columns=\"Sex\",    # the column values\n",
    "    values=[\"Count\", \"Name\"], \n",
    "    aggfunc=max,   # group operation\n",
    ")\n",
    "names_pivot.head(6)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}