Reading Data

To begin our studies in pandas, we must first import the library into our Python environment using import pandas as pd statement. pd is a common alias for pandas. The import statement will allow us to use pandas data structures and methods in our code.

CSV files can be in pandas using read_csv. The following code cell imports pandas as pd, the conventional alias for Pandas and then reads the elections.csv file.

# `pd` is the conventional alias for Pandas
import pandas as pd

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"
elections = pd.read_csv(url)
elections

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122
1	1824	John Quincy Adams	Democratic-Republican	113142	win	42.789878
2	1828	Andrew Jackson	Democratic	642806	win	56.203927
3	1828	John Quincy Adams	National Republican	500897	loss	43.796073
4	1832	Andrew Jackson	Democratic	702735	win	54.574789
...	...	...	...	...	...	...
177	2016	Jill Stein	Green	1457226	loss	1.073699
178	2020	Joseph Biden	Democratic	81268924	win	51.311515
179	2020	Donald Trump	Republican	74216154	loss	46.858542
180	2020	Jo Jorgensen	Libertarian	1865724	loss	1.177979
181	2020	Howard Hawkins	Green	405035	loss	0.255731

182 rows × 6 columns

Let’s dissect the code above.

We first import the pandas library into our Python environment, using the alias pd. import pandas as pd
There are a number of ways to read data into a DataFrame. In this course, our datasets are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following pandas function. pd.read_csv("data/elections.csv")

This code stores our DataFrame object in the elections variable. We see that our elections DataFrame has 182 rows and 6 columns (Year, Candidate, Party, Popular Vote, Result, %). Each row represents a single record – in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.

In the example above, we constructed a DataFrame object using data from a CSV file. As we’ll explore in the next section, we can also create a DataFrame with data of our own.

In the elections dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named Result stores whether or not the candidate won the election.

Some relevant arguments for read_csv are:

filepath_or_buffer: The path to the CSV file.
sep: The character that separates the values in the CSV file. By default, this is a comma ,.
header: The row number to use as the column names. By default, this is 0, which means the first row is used as the column names.
index_col: The column to use as the row labels of the DataFrame. By default, this is None, which means that the row labels are integers starting from 0.
error_bad_lines: If True, the parser will skip lines with too many fields rather than raising an error. By default, this is False.

Arguments

Separator

The sep argument specifies the character used to separate the values in the CSV file. By default, this is a comma ,. However, some CSV files use other characters, such as tabs or semicolons, to separate values. In such cases, we can specify the separator character using the sep argument.

Header row (column labels)

The header argument specifies the row number to use as the column names. By default, this is 0, which means the first row is used as the column names. If the CSV file does not have a header row, we can set header=None to use the default column names.

Index column (row labels)

The index_col argument specifies the column to use as the row labels of the DataFrame. By default, this is None, which means that the row labels are integers starting from 0. If we want to use one of the columns as the row labels, we can specify the column name or index using the index_col argument.

Ignore erroneous lines

The error_bad_lines argument specifies whether the parser should skip lines with too many fields rather than raising an error. By default, this is False, which means that the parser will raise an error if it encounters a line with too many fields. If we want the parser to skip such lines, we can set error_bad_lines=True.

Skip first k rows

The skiprows argument specifies the number of rows to skip at the beginning of the CSV file. By default, this is None, which means that no rows are skipped. If we want to skip a certain number of rows at the beginning of the file, we can specify the number of rows using the skiprows argument.

Read only k rows

The nrows argument specifies the number of rows to read from the CSV file. By default, this is None, which means that all rows are read. If we want to read only a certain number of rows from the file, we can specify the number of rows using the nrows argument.

Read only subset of columns

The usecols argument specifies the columns to read from the CSV file. By default, this is None, which means that all columns are read. If we want to read only certain columns from the file, we can specify the column names or indices using the usecols argument.

Character Encoding

The encoding argument specifies the character encoding to use when reading the CSV file. By default, this is None, which means that the encoding is detected automatically. If the CSV file uses a different character encoding, we can specify the encoding using the encoding argument. Some common character encodings are:

utf-8: Unicode Transformation Format 8-bit
latin1 or iso-8859-1: ISO 8859-1 supports many languages, including English, French, German, Spanish, and Portuguese.
cp1252: The cp1252 encoding is similar to the latin1 encoding, but it includes additional characters that are not present in latin1. The cp1252 encoding is commonly used in Windows operating systems.
ascii: The ASCII encoding is a 7-bit encoding that supports English characters and some special characters, such as punctuation marks and symbols.