import pandas as pd
= "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"
url = pd.read_csv(url) elections
Data Manipulation and Wrangling
Let’s read in the same elections data from the previous exercise and do some data manipulation and wrangling using pandas.
Data Alignment
pandas
can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:
= pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s1
= pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s2
+ s2 s1, s2, s1
(a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64,
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64,
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64)
The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.
In the case of DataFrame, alignment is performed on both rows and columns:
= pd.DataFrame({"A": [1, 2], "B":[3, 4]})
df1 = pd.DataFrame({"B": [5, 6], "D":[7, 8]})
df2 + df2 df1
Math Operations
In native Python, we have a number of operators that we can use to manipulate data. Most, if not all, of these operators can be used with Pandas Series and DataFrames and are applied element-wise in parallel. A summary of the operators supported by Pandas is shown below:
Category | Operators | Supported by Pandas | Comments |
---|---|---|---|
Arithmetic | + , - , * , / , % , // , ** |
✅ | Assuming comparable shapes (equal length) |
Assignment | = , += , -= , *= , /= , %= , //= , **= |
✅ | Assuming comparable shapes |
Comparison | == , != , > , < , >= , <= |
✅ | Assuming comparable shapes |
Logical | and , or , not |
❌ | Use & , \| , ~ instead |
Identity | is , is not |
✅ | Assuming comparable data type/structure |
Membership | in , not in |
❌ | Use isin() method instead |
Bitwise | & , \| , ^ , ~ , << , >> |
❌ |
The most significant difference is that logical operators and
, or
, and not
are NOT used with Pandas Series and DataFrames. Instead, we use &
, |
, and ~
respectively.
Membership operators in
and not in
are also not used with Pandas Series and DataFrames. Instead, we use the isin()
method.
Creating new columns
Creating new columns in a DataFrame is a common task when working with data. In this notebook, we will see how to create new columns in a DataFrame based on existing columns or other values.
New columns can be created by assigning a value to a new column name. For example, to create a new column named new_column
with a constant value 10
, we can use the following code:
'constant'] = 10
elections[
elections.head()
Year | Candidate | Party | Popular vote | Result | % | constant | |
---|---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 10 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 10 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 10 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 10 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 10 |
If we want to create a new column based on an existing column, we can refer to the existing column by its name, within the square brackets, on the right side of the assignment operator. For example, to create a new column named new_column
with the values of the existing column column1
, we can use the following code:
'total_voters'] = ((elections['Popular vote']* 100) / elections['%']).astype(int)
elections[
elections.head()
Year | Candidate | Party | Popular vote | Result | % | constant | total_voters | |
---|---|---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 | 10 | 264413 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 | 10 | 264412 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 | 10 | 1143702 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 | 10 | 1143703 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 | 10 | 1287655 |
Note that the new column will have the same length as the DataFrame and the calculations are element-wise. That is, the value of the new column at row i
will be calculated based on the value of the existing column at row i
.
These element-wise operations are vectorized and are very efficient.
Renaming columns
The .rename()
method is used to rename the columns or index labels of a DataFrame.
={"Candidate":"Name"}) elections.rename(columns
Year | Name | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... |
177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 |
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 |
181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 |
182 rows × 6 columns
Converting column to index
The .set_index()
method is used to set the DataFrame index using existing columns.
"Candidate") elections.set_index(
Year | Party | Popular vote | Result | % | |
---|---|---|---|---|---|
Candidate | |||||
Andrew Jackson | 1824 | Democratic-Republican | 151271 | loss | 57.210122 |
John Quincy Adams | 1824 | Democratic-Republican | 113142 | win | 42.789878 |
Andrew Jackson | 1828 | Democratic | 642806 | win | 56.203927 |
John Quincy Adams | 1828 | National Republican | 500897 | loss | 43.796073 |
Andrew Jackson | 1832 | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... |
Jill Stein | 2016 | Green | 1457226 | loss | 1.073699 |
Joseph Biden | 2020 | Democratic | 81268924 | win | 51.311515 |
Donald Trump | 2020 | Republican | 74216154 | loss | 46.858542 |
Jo Jorgensen | 2020 | Libertarian | 1865724 | loss | 1.177979 |
Howard Hawkins | 2020 | Green | 405035 | loss | 0.255731 |
182 rows × 5 columns
Converting index to column
The .reset_index()
method is used to reset the index of a DataFrame. By default, the original index is stored in a new column called index
.
elections.reset_index()
index | Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|---|
0 | 0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... | ... |
177 | 177 | 2016 | Jill Stein | Green | 1457226 | loss | 1.073699 |
178 | 178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
179 | 179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
180 | 180 | 2020 | Jo Jorgensen | Libertarian | 1865724 | loss | 1.177979 |
181 | 181 | 2020 | Howard Hawkins | Green | 405035 | loss | 0.255731 |
182 rows × 7 columns
Sorting Data
Ordering a DataFrame
can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values
allows us to order a DataFrame
or Series
by a specified column. We can choose to either receive the rows in ascending
order (default) or descending
order.
# Sort the "Count" column from highest to lowest
= "%", ascending=False).head() elections.sort_values(by
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
114 | 1964 | Lyndon Johnson | Democratic | 43127041 | win | 61.344703 |
91 | 1936 | Franklin Roosevelt | Democratic | 27752648 | win | 60.978107 |
120 | 1972 | Richard Nixon | Republican | 47168710 | win | 60.907806 |
79 | 1920 | Warren Harding | Republican | 16144093 | win | 60.574501 |
133 | 1984 | Ronald Reagan | Republican | 54455472 | win | 59.023326 |
We do not need to explicitly specify the column used for sorting when calling .value_counts()
on a Series
. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.
# Sort the "Name" Series alphabetically
"Candidate"].sort_values(ascending=True).head() elections[
75 Aaron S. Watkins
27 Abraham Lincoln
23 Abraham Lincoln
108 Adlai Stevenson
105 Adlai Stevenson
Name: Candidate, dtype: object