Data Manipulation and Wrangling

Let’s read in the same elections data from the previous exercise and do some data manipulation and wrangling using pandas.

import pandas as pd 
url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"
elections = pd.read_csv(url)

Data Alignment

pandas can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])

s1, s2, s1 + s2
(a    7.3
 c   -2.5
 d    3.4
 e    1.5
 dtype: float64,
 a   -2.1
 c    3.6
 e   -1.5
 f    4.0
 g    3.1
 dtype: float64,
 a    5.2
 c    1.1
 d    NaN
 e    0.0
 f    NaN
 g    NaN
 dtype: float64)

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both rows and columns:

df1 = pd.DataFrame({"A": [1, 2], "B":[3, 4]})
df2 = pd.DataFrame({"B": [5, 6], "D":[7, 8]})
df1 + df2


Math Operations

In native Python, we have a number of operators that we can use to manipulate data. Most, if not all, of these operators can be used with Pandas Series and DataFrames and are applied element-wise in parallel. A summary of the operators supported by Pandas is shown below:


Category Operators Supported by Pandas Comments
Arithmetic +, -, *, /, %, //, ** Assuming comparable shapes (equal length)
Assignment =, +=, -=, *=, /=, %=, //=, **= Assuming comparable shapes
Comparison ==, !=, >, <, >=, <= Assuming comparable shapes
Logical and, or, not Use &, \|, ~ instead
Identity is, is not Assuming comparable data type/structure
Membership in, not in Use isin() method instead
Bitwise &, \|, ^, ~, <<, >>


The most significant difference is that logical operators and, or, and not are NOT used with Pandas Series and DataFrames. Instead, we use &, |, and ~ respectively.

Membership operators in and not in are also not used with Pandas Series and DataFrames. Instead, we use the isin() method.


Creating new columns

Creating new columns in a DataFrame is a common task when working with data. In this notebook, we will see how to create new columns in a DataFrame based on existing columns or other values.

New columns can be created by assigning a value to a new column name. For example, to create a new column named new_column with a constant value 10, we can use the following code:

elections['constant'] = 10

elections.head()
Year Candidate Party Popular vote Result % constant
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 10
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 10
2 1828 Andrew Jackson Democratic 642806 win 56.203927 10
3 1828 John Quincy Adams National Republican 500897 loss 43.796073 10
4 1832 Andrew Jackson Democratic 702735 win 54.574789 10


If we want to create a new column based on an existing column, we can refer to the existing column by its name, within the square brackets, on the right side of the assignment operator. For example, to create a new column named new_column with the values of the existing column column1, we can use the following code:


elections['total_voters'] = ((elections['Popular vote']* 100) / elections['%']).astype(int)

elections.head()
Year Candidate Party Popular vote Result % constant total_voters
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122 10 264413
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878 10 264412
2 1828 Andrew Jackson Democratic 642806 win 56.203927 10 1143702
3 1828 John Quincy Adams National Republican 500897 loss 43.796073 10 1143703
4 1832 Andrew Jackson Democratic 702735 win 54.574789 10 1287655

Note that the new column will have the same length as the DataFrame and the calculations are element-wise. That is, the value of the new column at row i will be calculated based on the value of the existing column at row i.

These element-wise operations are vectorized and are very efficient.

Renaming columns

The .rename() method is used to rename the columns or index labels of a DataFrame.

elections.rename(columns={"Candidate":"Name"})
Year Name Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

Converting column to index

The .set_index() method is used to set the DataFrame index using existing columns.

elections.set_index("Candidate")
Year Party Popular vote Result %
Candidate
Andrew Jackson 1824 Democratic-Republican 151271 loss 57.210122
John Quincy Adams 1824 Democratic-Republican 113142 win 42.789878
Andrew Jackson 1828 Democratic 642806 win 56.203927
John Quincy Adams 1828 National Republican 500897 loss 43.796073
Andrew Jackson 1832 Democratic 702735 win 54.574789
... ... ... ... ... ...
Jill Stein 2016 Green 1457226 loss 1.073699
Joseph Biden 2020 Democratic 81268924 win 51.311515
Donald Trump 2020 Republican 74216154 loss 46.858542
Jo Jorgensen 2020 Libertarian 1865724 loss 1.177979
Howard Hawkins 2020 Green 405035 loss 0.255731

182 rows × 5 columns

Converting index to column

The .reset_index() method is used to reset the index of a DataFrame. By default, the original index is stored in a new column called index.

elections.reset_index()
index Year Candidate Party Popular vote Result %
0 0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ... ...
177 177 2016 Jill Stein Green 1457226 loss 1.073699
178 178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 179 2020 Donald Trump Republican 74216154 loss 46.858542
180 180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 7 columns

Sorting Data

Ordering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values allows us to order a DataFrame or Series by a specified column. We can choose to either receive the rows in ascending order (default) or descending order.

# Sort the "Count" column from highest to lowest
elections.sort_values(by = "%", ascending=False).head()
Year Candidate Party Popular vote Result %
114 1964 Lyndon Johnson Democratic 43127041 win 61.344703
91 1936 Franklin Roosevelt Democratic 27752648 win 60.978107
120 1972 Richard Nixon Republican 47168710 win 60.907806
79 1920 Warren Harding Republican 16144093 win 60.574501
133 1984 Ronald Reagan Republican 54455472 win 59.023326

We do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.

# Sort the "Name" Series alphabetically
elections["Candidate"].sort_values(ascending=True).head()
75     Aaron S. Watkins
27      Abraham Lincoln
23      Abraham Lincoln
108     Adlai Stevenson
105     Adlai Stevenson
Name: Candidate, dtype: object