{ "cells": [ { "cell_type": "markdown", "id": "620343fd", "metadata": {}, "source": [ "# Pandas V: Concatenation and Merging\n" ] }, { "cell_type": "markdown", "id": "46424189", "metadata": {}, "source": [ "## Concatenating DataFrames\n", "\n", "Another way to combine DataFrames is to concatenate them. Concatenation is a bit different from joining. When we join two DataFrames, we are combining them *horizontally* – that is, we are adding new columns to an existing DataFrame. Concatenation, on the other hand, is _generally_ a *vertical* operation – we are adding new rows to an existing DataFrame.\n", "\n", "```{image} https://pandas.pydata.org/docs/_images/08_concat_row.svg\n", ":width: 70%\n", ":align: center\n", "```\n", "```{image} https://pandas.pydata.org/docs/_images/merging_concat_basic.png\n", ":width: 50%\n", ":align: center\n", "```\n", "\n", "`pd.concat` is the `pandas` method used to concatenate DataFrames together. It takes as input a list of DataFrames to be concatenated and returns a new DataFrame containing all of the rows from each input DataFrame in the input list." ] }, { "cell_type": "markdown", "id": "30381c58", "metadata": {}, "source": [ "Let's say we wanted to concatenate data from two different years in `babynames`. We can do so using the [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) method.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "651c0361", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namesexcountyear
0MaryF70651880
1AnnaF26041880
2EmmaF20031880
3ElizabethF19391880
4MinnieF17461880
...............
31910ZuberiM52022
31911ZydnM52022
31912ZylonM52022
31913ZymeerM52022
31914ZymeireM52022
\n", "

2085158 rows × 4 columns

\n", "
" ], "text/plain": [ " name sex count year\n", "0 Mary F 7065 1880\n", "1 Anna F 2604 1880\n", "2 Emma F 2003 1880\n", "3 Elizabeth F 1939 1880\n", "4 Minnie F 1746 1880\n", "... ... .. ... ...\n", "31910 Zuberi M 5 2022\n", "31911 Zydn M 5 2022\n", "31912 Zylon M 5 2022\n", "31913 Zymeer M 5 2022\n", "31914 Zymeire M 5 2022\n", "\n", "[2085158 rows x 4 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd \n", "url_template = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/names/yob%s.txt\"\n", "\n", "data_list = []\n", "for year in range(1880, 2023):\n", " url = url_template % year\n", " data = pd.read_csv(url, header=None, names=['name', 'sex', 'count'])\n", " data['year'] = year\n", " data_list.append(data)\n", "all_data = pd.concat(data_list)\n", "all_data" ] }, { "cell_type": "markdown", "id": "b0270138", "metadata": {}, "source": [ "## Merging DataFrames\n", "\n", "When working on data science projects, we're unlikely to have absolutely all the data we want contained in a single DataFrame – a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can merge two or more tables into a single DataFrame. \n", "\n", "To put this into practice, we'll revisit the `elections` dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "08a04240", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%
01824Andrew JacksonDemocratic-Republican151271loss57.210122
11824John Quincy AdamsDemocratic-Republican113142win42.789878
21828Andrew JacksonDemocratic642806win56.203927
31828John Quincy AdamsNational Republican500897loss43.796073
41832Andrew JacksonDemocratic702735win54.574789
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result \\\n", "0 1824 Andrew Jackson Democratic-Republican 151271 loss \n", "1 1824 John Quincy Adams Democratic-Republican 113142 win \n", "2 1828 Andrew Jackson Democratic 642806 win \n", "3 1828 John Quincy Adams National Republican 500897 loss \n", "4 1832 Andrew Jackson Democratic 702735 win \n", "\n", " % \n", "0 57.210122 \n", "1 42.789878 \n", "2 56.203927 \n", "3 43.796073 \n", "4 54.574789 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "elections.head(5)" ] }, { "cell_type": "markdown", "id": "052f5eb3", "metadata": {}, "source": [ "Say we want to understand the popularity of the names of each presidential candidate in 2020. To do this, we'll need the combined data of `names` *and* `elections`. \n", "\n", "```{figure} https://i.ibb.co/sKWBPM0/merge.png\n", "---\n", "width: 100%\n", "name: merge\n", "align: center\n", "---\n", "Merging two DataFrames\n", "```\n", "\n", "\n", "\n", "\n", "We'll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in `elections` to the corresponding name data in `names`. " ] }, { "cell_type": "code", "execution_count": null, "id": "b71b66be", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearCandidatePartyPopular voteResult%First Name
01824Andrew JacksonDemocratic-Republican151271loss57.210122Andrew
11824John Quincy AdamsDemocratic-Republican113142win42.789878John
21828Andrew JacksonDemocratic642806win56.203927Andrew
31828John Quincy AdamsNational Republican500897loss43.796073John
41832Andrew JacksonDemocratic702735win54.574789Andrew
\n", "
" ], "text/plain": [ " Year Candidate Party Popular vote Result \\\n", "0 1824 Andrew Jackson Democratic-Republican 151271 loss \n", "1 1824 John Quincy Adams Democratic-Republican 113142 win \n", "2 1828 Andrew Jackson Democratic 642806 win \n", "3 1828 John Quincy Adams National Republican 500897 loss \n", "4 1832 Andrew Jackson Democratic 702735 win \n", "\n", " % First Name \n", "0 57.210122 Andrew \n", "1 42.789878 John \n", "2 56.203927 Andrew \n", "3 43.796073 John \n", "4 54.574789 Andrew " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This `str` operation splits each candidate's full name at each \n", "# blank space, then takes just the candidiate's first name\n", "elections[\"First Name\"] = elections[\"Candidate\"].str.split().str[0]\n", "elections.head(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "202f3b7e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYearFirst Letter
0OliviaF176412020O
1EmmaF156562020E
2AvaF131602020A
3CharlotteF130652020C
4SophiaF130362020S
\n", "
" ], "text/plain": [ " Name Sex Count Year First Letter\n", "0 Olivia F 17641 2020 O\n", "1 Emma F 15656 2020 E\n", "2 Ava F 13160 2020 A\n", "3 Charlotte F 13065 2020 C\n", "4 Sophia F 13036 2020 S" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Here, we'll only consider `names` data from 2020\n", "names_2020 = names[names[\"Year\"]==2020]\n", "names_2020.head()" ] }, { "cell_type": "markdown", "id": "05196734", "metadata": {}, "source": [ "Now, we're ready to merge the two tables. [`pd.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) is the `pandas` method used to merge DataFrames together." ] }, { "cell_type": "code", "execution_count": null, "id": "6e956b2c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Year_xCandidatePartyPopular voteResult%First NameNameSexCountYear_yFirst Letter
01824Andrew JacksonDemocratic-Republican151271loss57.210122AndrewAndrewF122020A
11824Andrew JacksonDemocratic-Republican151271loss57.210122AndrewAndrewM60362020A
21828Andrew JacksonDemocratic642806win56.203927AndrewAndrewF122020A
31828Andrew JacksonDemocratic642806win56.203927AndrewAndrewM60362020A
41832Andrew JacksonDemocratic702735win54.574789AndrewAndrewF122020A
\n", "
" ], "text/plain": [ " Year_x Candidate Party Popular vote Result \\\n", "0 1824 Andrew Jackson Democratic-Republican 151271 loss \n", "1 1824 Andrew Jackson Democratic-Republican 151271 loss \n", "2 1828 Andrew Jackson Democratic 642806 win \n", "3 1828 Andrew Jackson Democratic 642806 win \n", "4 1832 Andrew Jackson Democratic 702735 win \n", "\n", " % First Name Name Sex Count Year_y First Letter \n", "0 57.210122 Andrew Andrew F 12 2020 A \n", "1 57.210122 Andrew Andrew M 6036 2020 A \n", "2 56.203927 Andrew Andrew F 12 2020 A \n", "3 56.203927 Andrew Andrew M 6036 2020 A \n", "4 54.574789 Andrew Andrew F 12 2020 A " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "merged = pd.merge(left = elections, right = names_2020, \\\n", " left_on = \"First Name\", right_on = \"Name\")\n", "merged.head()\n", "# Notice that pandas automatically specifies `Year_x` and `Year_y` \n", "# when both merged DataFrames have the same column name to avoid confusion" ] }, { "cell_type": "markdown", "id": "b3c94890", "metadata": {}, "source": [ "Let's take a closer look at the parameters:\n", "\n", "* `left` and `right` parameters are used to specify the DataFrames to be merge.\n", "* `left_on` and `right_on` parameters are assigned to the string names of the columns to be used when performing the merge. These two `on` parameters tell `pandas` what values should act as pairing keys to determine which rows to merge across the DataFrames. We'll talk more about this idea of a pairing key next lecture.\n", "\n", "\n", "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge1.png\n", "---\n", "width: 80%\n", "align: center\n", "``` \n", "\n", "---\n", "\n", "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge2.png\n", "---\n", "width: 80%\n", "align: center\n", "``` \n", "\n", "---\n", "\n", "```{figure} https://raw.githubusercontent.com/fahadsultan/csc272/main/assets/merge3.png\n", "---\n", "width: 80%\n", "align: center\n", "``` " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }