Filtering: subset of rows

Extracting a subset of rows from a DataFrame is called filtering.

We can filter rows based on a boolean condition, similar to conditional statements (e.g., if, else) in Python.

For example, to filter rows of candidates who ran for elections since 2010, we can use the following code:

import pandas as pd 

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)

First k rows

The .head() method is used to view the first few rows of a DataFrame. By default, it returns the first 5 rows.

elections.head()
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Last k rows

The .tail() method is used to view the last few rows of a DataFrame. By default, it returns the last 5 rows.

elections.tail()
Year Candidate Party Popular vote Result %
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

Random k rows

The .sample() method is used to view a random sample of rows from a DataFrame. By default, it returns a single random row.

elections.sample()
Year Candidate Party Popular vote Result %
162 2008 Barack Obama Democratic 69498516 win 53.02351

.sample() method can also be used to return multiple random rows by specifying the number of rows to return using the n parameter.

elections.sample(7)
Year Candidate Party Popular vote Result %
111 1960 John Kennedy Democratic 34220984 win 50.082561
20 1856 James Buchanan Democratic 1835140 win 45.306080
67 1912 Eugene W. Chafin Prohibition 208156 loss 1.386325
50 1896 John M. Palmer National Democratic 134645 loss 0.969566
32 1872 Ulysses Grant Republican 3597439 win 55.928594
101 1948 Henry A. Wallace Progressive 1157328 loss 2.374144
14 1848 Lewis Cass Democratic 1223460 loss 42.552229

Filtering by position

The [] selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument: A slice of row numbers

Say we wanted the first four rows of our elections DataFrame.

elections[0:4]
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073

Filtering by condition

Step 1. A Filtering Condition

Perhaps the most interesting (and useful) method of selecting data from a Series is with a filtering condition.

First, we apply a boolean condition to the Series. This create a new Series of boolean values.

series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})
series > 2
a    False
b    False
c     True
d     True
dtype: bool

We then use this boolean condition to index into our original Series. pandas will select only the entries in the original Series that satisfy the condition.

condition1 = series > 2
condition2 = elections['Year'] > 2010

Step 2. Applying the Condition

The second step is to apply the condition (pd.Series of boolean values, equal to the length of the DataFrame) to the DataFrame using the [] operator.

series[condition1]
c    3
d    4
dtype: int64
elections[condition2]
Year Candidate Party Popular vote Result %
168 2012 Barack Obama Democratic 65915795 win 51.258484
169 2012 Gary Johnson Libertarian 1275971 loss 0.992241
170 2012 Jill Stein Green 469627 loss 0.365199
171 2012 Mitt Romney Republican 60933504 loss 47.384076
172 2016 Darrell Castle Constitution 203091 loss 0.149640
173 2016 Donald Trump Republican 62984828 win 46.407862
174 2016 Evan McMullin Independent 732273 loss 0.539546
175 2016 Gary Johnson Libertarian 4489235 loss 3.307714
176 2016 Hillary Clinton Democratic 65853514 loss 48.521539
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

To filter rows based on multiple conditions, we can use the & operator for AND and the | operator for OR.

For example, to filter rows of candidates who won the elections with less than 50% of the votes, we can use the following code:

condition = (elections['Result'] == 'win') & (elections['%'] < 50)

elections[condition]
Year Candidate Party Popular vote Result %
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
16 1848 Zachary Taylor Whig 1360235 win 47.309296
20 1856 James Buchanan Democratic 1835140 win 45.306080
23 1860 Abraham Lincoln Republican 1855993 win 39.699408
33 1876 Rutherford Hayes Republican 4034142 win 48.471624
36 1880 James Garfield Republican 4453337 win 48.369234
39 1884 Grover Cleveland Democratic 4914482 win 48.884933
43 1888 Benjamin Harrison Republican 5443633 win 47.858041
47 1892 Grover Cleveland Democratic 5553898 win 46.121393
70 1912 Woodrow Wilson Democratic 6296284 win 41.933422
74 1916 Woodrow Wilson Democratic 9126868 win 49.367987
100 1948 Harry Truman Democratic 24179347 win 49.601536
117 1968 Richard Nixon Republican 31783783 win 43.565246
140 1992 Bill Clinton Democratic 44909806 win 43.118485
144 1996 Bill Clinton Democratic 47400125 win 49.296938
152 2000 George W. Bush Republican 50456002 win 47.974666
173 2016 Donald Trump Republican 62984828 win 46.407862

.drop_duplicates()

If we have a DataFrame with many repeated rows, then .drop_duplicates() can be used to remove the repeated rows.

Where .unique() only works with individual columns (Series) and returns an array of unique values, .drop_duplicates() can be used with multiple columns (DataFrame) and returns a DataFrame with the repeated rows removed.

elections[['Party']].drop_duplicates()
Party
0 Democratic-Republican
2 Democratic
3 National Republican
6 Anti-Masonic
7 Whig
15 Free Soil
21 Republican
22 American
24 Constitutional Union
25 Southern Democratic
26 Northern Democratic
27 National Union
31 Liberal Republican
35 Greenback
38 Anti-Monopoly
41 Prohibition
42 Union Labor
48 Populist
50 National Democratic
58 Socialist
68 Progressive
78 Farmer–Labor
89 Communist
93 Union
103 Dixiecrat
110 States' Rights
115 American Independent
121 Independent
125 Libertarian
127 Citizens
136 New Alliance
147 Taxpayers
148 Natural Law
149 Green
150 Reform
160 Constitution