To begin our studies in pandas, we must first import the library into our Python environment using import pandas as pd statement. pd is a common alias for pandas. The import statement will allow us to use pandas data structures and methods in our code.
CSV files can be in pandas using read_csv. The following code cell imports pandas as pd, the conventional alias for Pandas and then reads the elections.csv file.
# `pd` is the conventional alias for Pandasimport pandas as pdurl ="https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"elections = pd.read_csv(url)elections
Year
Candidate
Party
Popular vote
Result
%
0
1824
Andrew Jackson
Democratic-Republican
151271
loss
57.210122
1
1824
John Quincy Adams
Democratic-Republican
113142
win
42.789878
2
1828
Andrew Jackson
Democratic
642806
win
56.203927
3
1828
John Quincy Adams
National Republican
500897
loss
43.796073
4
1832
Andrew Jackson
Democratic
702735
win
54.574789
...
...
...
...
...
...
...
177
2016
Jill Stein
Green
1457226
loss
1.073699
178
2020
Joseph Biden
Democratic
81268924
win
51.311515
179
2020
Donald Trump
Republican
74216154
loss
46.858542
180
2020
Jo Jorgensen
Libertarian
1865724
loss
1.177979
181
2020
Howard Hawkins
Green
405035
loss
0.255731
182 rows × 6 columns
Let’s dissect the code above.
We first import the pandas library into our Python environment, using the alias pd. import pandas as pd
There are a number of ways to read data into a DataFrame. In this course, our datasets are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following pandas function. pd.read_csv("data/elections.csv")
This code stores our DataFrame object in the elections variable. We see that our elections DataFrame has 182 rows and 6 columns (Year, Candidate, Party, Popular Vote, Result, %). Each row represents a single record – in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.
In the example above, we constructed a DataFrame object using data from a CSV file. As we’ll explore in the next section, we can also create a DataFrame with data of our own.
In the elections dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named Result stores whether or not the candidate won the election.
Some relevant arguments for read_csv are:
filepath_or_buffer: The path to the CSV file.
sep: The character that separates the values in the CSV file. By default, this is a comma ,.
header: The row number to use as the column names. By default, this is 0, which means the first row is used as the column names.
index_col: The column to use as the row labels of the DataFrame. By default, this is None, which means that the row labels are integers starting from 0.
error_bad_lines: If True, the parser will skip lines with too many fields rather than raising an error. By default, this is False.
Arguments
Separator
The sep argument specifies the character used to separate the values in the CSV file. By default, this is a comma ,. However, some CSV files use other characters, such as tabs or semicolons, to separate values. In such cases, we can specify the separator character using the sep argument.
Header row (column labels)
The header argument specifies the row number to use as the column names. By default, this is 0, which means the first row is used as the column names. If the CSV file does not have a header row, we can set header=None to use the default column names.
Index column (row labels)
The index_col argument specifies the column to use as the row labels of the DataFrame. By default, this is None, which means that the row labels are integers starting from 0. If we want to use one of the columns as the row labels, we can specify the column name or index using the index_col argument.
Ignore erroneous lines
The error_bad_lines argument specifies whether the parser should skip lines with too many fields rather than raising an error. By default, this is False, which means that the parser will raise an error if it encounters a line with too many fields. If we want the parser to skip such lines, we can set error_bad_lines=True.
Skip first k rows
The skiprows argument specifies the number of rows to skip at the beginning of the CSV file. By default, this is None, which means that no rows are skipped. If we want to skip a certain number of rows at the beginning of the file, we can specify the number of rows using the skiprows argument.
Read only k rows
The nrows argument specifies the number of rows to read from the CSV file. By default, this is None, which means that all rows are read. If we want to read only a certain number of rows from the file, we can specify the number of rows using the nrows argument.
Read only subset of columns
The usecols argument specifies the columns to read from the CSV file. By default, this is None, which means that all columns are read. If we want to read only certain columns from the file, we can specify the column names or indices using the usecols argument.
Character Encoding
The encoding argument specifies the character encoding to use when reading the CSV file. By default, this is None, which means that the encoding is detected automatically. If the CSV file uses a different character encoding, we can specify the encoding using the encoding argument. Some common character encodings are:
utf-8: Unicode Transformation Format 8-bit
latin1 or iso-8859-1: ISO 8859-1 supports many languages, including English, French, German, Spanish, and Portuguese.
cp1252: The cp1252 encoding is similar to the latin1 encoding, but it includes additional characters that are not present in latin1. The cp1252 encoding is commonly used in Windows operating systems.
ascii: The ASCII encoding is a 7-bit encoding that supports English characters and some special characters, such as punctuation marks and symbols.