{
"cells": [
{
"cell_type": "markdown",
"id": "620343fd",
"metadata": {},
"source": [
"# Pandas V: Concatenation and Merging\n"
]
},
{
"cell_type": "markdown",
"id": "46424189",
"metadata": {},
"source": [
"## Concatenating DataFrames\n",
"\n",
"Another way to combine DataFrames is to concatenate them. Concatenation is a bit different from joining. When we join two DataFrames, we are combining them *horizontally* – that is, we are adding new columns to an existing DataFrame. Concatenation, on the other hand, is _generally_ a *vertical* operation – we are adding new rows to an existing DataFrame.\n",
"\n",
"```{image} https://pandas.pydata.org/docs/_images/08_concat_row.svg\n",
":width: 70%\n",
":align: center\n",
"```\n",
"```{image} https://pandas.pydata.org/docs/_images/merging_concat_basic.png\n",
":width: 50%\n",
":align: center\n",
"```\n",
"\n",
"`pd.concat` is the `pandas` method used to concatenate DataFrames together. It takes as input a list of DataFrames to be concatenated and returns a new DataFrame containing all of the rows from each input DataFrame in the input list."
]
},
{
"cell_type": "markdown",
"id": "30381c58",
"metadata": {},
"source": [
"Let's say we wanted to concatenate data from two different years in `babynames`. We can do so using the [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) method.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "651c0361",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" sex | \n",
" count | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Mary | \n",
" F | \n",
" 7065 | \n",
" 1880 | \n",
"
\n",
" \n",
" 1 | \n",
" Anna | \n",
" F | \n",
" 2604 | \n",
" 1880 | \n",
"
\n",
" \n",
" 2 | \n",
" Emma | \n",
" F | \n",
" 2003 | \n",
" 1880 | \n",
"
\n",
" \n",
" 3 | \n",
" Elizabeth | \n",
" F | \n",
" 1939 | \n",
" 1880 | \n",
"
\n",
" \n",
" 4 | \n",
" Minnie | \n",
" F | \n",
" 1746 | \n",
" 1880 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 31910 | \n",
" Zuberi | \n",
" M | \n",
" 5 | \n",
" 2022 | \n",
"
\n",
" \n",
" 31911 | \n",
" Zydn | \n",
" M | \n",
" 5 | \n",
" 2022 | \n",
"
\n",
" \n",
" 31912 | \n",
" Zylon | \n",
" M | \n",
" 5 | \n",
" 2022 | \n",
"
\n",
" \n",
" 31913 | \n",
" Zymeer | \n",
" M | \n",
" 5 | \n",
" 2022 | \n",
"
\n",
" \n",
" 31914 | \n",
" Zymeire | \n",
" M | \n",
" 5 | \n",
" 2022 | \n",
"
\n",
" \n",
"
\n",
"
2085158 rows × 4 columns
\n",
"
"
],
"text/plain": [
" name sex count year\n",
"0 Mary F 7065 1880\n",
"1 Anna F 2604 1880\n",
"2 Emma F 2003 1880\n",
"3 Elizabeth F 1939 1880\n",
"4 Minnie F 1746 1880\n",
"... ... .. ... ...\n",
"31910 Zuberi M 5 2022\n",
"31911 Zydn M 5 2022\n",
"31912 Zylon M 5 2022\n",
"31913 Zymeer M 5 2022\n",
"31914 Zymeire M 5 2022\n",
"\n",
"[2085158 rows x 4 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd \n",
"url_template = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/names/yob%s.txt\"\n",
"\n",
"data_list = []\n",
"for year in range(1880, 2023):\n",
" url = url_template % year\n",
" data = pd.read_csv(url, header=None, names=['name', 'sex', 'count'])\n",
" data['year'] = year\n",
" data_list.append(data)\n",
"all_data = pd.concat(data_list)\n",
"all_data"
]
},
{
"cell_type": "markdown",
"id": "b0270138",
"metadata": {},
"source": [
"## Merging DataFrames\n",
"\n",
"When working on data science projects, we're unlikely to have absolutely all the data we want contained in a single DataFrame – a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can merge two or more tables into a single DataFrame. \n",
"\n",
"To put this into practice, we'll revisit the `elections` dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08a04240",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" 113142 | \n",
" win | \n",
" 42.789878 | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" 500897 | \n",
" loss | \n",
" 43.796073 | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 702735 | \n",
" win | \n",
" 54.574789 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"0 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"1 1824 John Quincy Adams Democratic-Republican 113142 win \n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 John Quincy Adams National Republican 500897 loss \n",
"4 1832 Andrew Jackson Democratic 702735 win \n",
"\n",
" % \n",
"0 57.210122 \n",
"1 42.789878 \n",
"2 56.203927 \n",
"3 43.796073 \n",
"4 54.574789 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"elections.head(5)"
]
},
{
"cell_type": "markdown",
"id": "052f5eb3",
"metadata": {},
"source": [
"Say we want to understand the popularity of the names of each presidential candidate in 2020. To do this, we'll need the combined data of `names` *and* `elections`. \n",
"\n",
"```{figure} https://i.ibb.co/sKWBPM0/merge.png\n",
"---\n",
"width: 100%\n",
"name: merge\n",
"align: center\n",
"---\n",
"Merging two DataFrames\n",
"```\n",
"\n",
"\n",
"\n",
"\n",
"We'll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in `elections` to the corresponding name data in `names`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b71b66be",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
" First Name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
" Andrew | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" John Quincy Adams | \n",
" Democratic-Republican | \n",
" 113142 | \n",
" win | \n",
" 42.789878 | \n",
" John | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
" Andrew | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" John Quincy Adams | \n",
" National Republican | \n",
" 500897 | \n",
" loss | \n",
" 43.796073 | \n",
" John | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 702735 | \n",
" win | \n",
" 54.574789 | \n",
" Andrew | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year Candidate Party Popular vote Result \\\n",
"0 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"1 1824 John Quincy Adams Democratic-Republican 113142 win \n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 John Quincy Adams National Republican 500897 loss \n",
"4 1832 Andrew Jackson Democratic 702735 win \n",
"\n",
" % First Name \n",
"0 57.210122 Andrew \n",
"1 42.789878 John \n",
"2 56.203927 Andrew \n",
"3 43.796073 John \n",
"4 54.574789 Andrew "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# This `str` operation splits each candidate's full name at each \n",
"# blank space, then takes just the candidiate's first name\n",
"elections[\"First Name\"] = elections[\"Candidate\"].str.split().str[0]\n",
"elections.head(5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "202f3b7e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Sex | \n",
" Count | \n",
" Year | \n",
" First Letter | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Olivia | \n",
" F | \n",
" 17641 | \n",
" 2020 | \n",
" O | \n",
"
\n",
" \n",
" 1 | \n",
" Emma | \n",
" F | \n",
" 15656 | \n",
" 2020 | \n",
" E | \n",
"
\n",
" \n",
" 2 | \n",
" Ava | \n",
" F | \n",
" 13160 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
" 3 | \n",
" Charlotte | \n",
" F | \n",
" 13065 | \n",
" 2020 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" Sophia | \n",
" F | \n",
" 13036 | \n",
" 2020 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Sex Count Year First Letter\n",
"0 Olivia F 17641 2020 O\n",
"1 Emma F 15656 2020 E\n",
"2 Ava F 13160 2020 A\n",
"3 Charlotte F 13065 2020 C\n",
"4 Sophia F 13036 2020 S"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Here, we'll only consider `names` data from 2020\n",
"names_2020 = names[names[\"Year\"]==2020]\n",
"names_2020.head()"
]
},
{
"cell_type": "markdown",
"id": "05196734",
"metadata": {},
"source": [
"Now, we're ready to merge the two tables. [`pd.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) is the `pandas` method used to merge DataFrames together."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e956b2c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Year_x | \n",
" Candidate | \n",
" Party | \n",
" Popular vote | \n",
" Result | \n",
" % | \n",
" First Name | \n",
" Name | \n",
" Sex | \n",
" Count | \n",
" Year_y | \n",
" First Letter | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
" Andrew | \n",
" Andrew | \n",
" F | \n",
" 12 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" 1824 | \n",
" Andrew Jackson | \n",
" Democratic-Republican | \n",
" 151271 | \n",
" loss | \n",
" 57.210122 | \n",
" Andrew | \n",
" Andrew | \n",
" M | \n",
" 6036 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
" 2 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
" Andrew | \n",
" Andrew | \n",
" F | \n",
" 12 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
" 3 | \n",
" 1828 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 642806 | \n",
" win | \n",
" 56.203927 | \n",
" Andrew | \n",
" Andrew | \n",
" M | \n",
" 6036 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
" 4 | \n",
" 1832 | \n",
" Andrew Jackson | \n",
" Democratic | \n",
" 702735 | \n",
" win | \n",
" 54.574789 | \n",
" Andrew | \n",
" Andrew | \n",
" F | \n",
" 12 | \n",
" 2020 | \n",
" A | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Year_x Candidate Party Popular vote Result \\\n",
"0 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"1 1824 Andrew Jackson Democratic-Republican 151271 loss \n",
"2 1828 Andrew Jackson Democratic 642806 win \n",
"3 1828 Andrew Jackson Democratic 642806 win \n",
"4 1832 Andrew Jackson Democratic 702735 win \n",
"\n",
" % First Name Name Sex Count Year_y First Letter \n",
"0 57.210122 Andrew Andrew F 12 2020 A \n",
"1 57.210122 Andrew Andrew M 6036 2020 A \n",
"2 56.203927 Andrew Andrew F 12 2020 A \n",
"3 56.203927 Andrew Andrew M 6036 2020 A \n",
"4 54.574789 Andrew Andrew F 12 2020 A "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"merged = pd.merge(left = elections, right = names_2020, \\\n",
" left_on = \"First Name\", right_on = \"Name\")\n",
"merged.head()\n",
"# Notice that pandas automatically specifies `Year_x` and `Year_y` \n",
"# when both merged DataFrames have the same column name to avoid confusion"
]
},
{
"cell_type": "markdown",
"id": "b3c94890",
"metadata": {},
"source": [
"Let's take a closer look at the parameters:\n",
"\n",
"* `left` and `right` parameters are used to specify the DataFrames to be merge.\n",
"* `left_on` and `right_on` parameters are assigned to the string names of the columns to be used when performing the merge. These two `on` parameters tell `pandas` what values should act as pairing keys to determine which rows to merge across the DataFrames. We'll talk more about this idea of a pairing key next lecture.\n",
"\n",
"\n",
"```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge1.png\n",
"---\n",
"width: 80%\n",
"align: center\n",
"``` \n",
"\n",
"---\n",
"\n",
"```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge2.png\n",
"---\n",
"width: 80%\n",
"align: center\n",
"``` \n",
"\n",
"---\n",
"\n",
"```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge3.png\n",
"---\n",
"width: 80%\n",
"align: center\n",
"``` "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}