Reading Data

To begin our studies in pandas, we must first import the library into our Python environment using import pandas as pd statement. pd is a common alias for pandas. The import statement will allow us to use pandas data structures and methods in our code.





CSV files can be in pandas using read_csv. The following code cell imports pandas as pd, the conventional alias for Pandas and then reads the elections.csv file.

# `pd` is the conventional alias for Pandas
import pandas as pd

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"
elections = pd.read_csv(url)
elections
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
177 2016 Jill Stein Green 1457226 loss 1.073699
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
180 2020 Jo Jorgensen Libertarian 1865724 loss 1.177979
181 2020 Howard Hawkins Green 405035 loss 0.255731

182 rows × 6 columns

Let’s dissect the code above.

  1. We first import the pandas library into our Python environment, using the alias pd.  import pandas as pd

  2. There are a number of ways to read data into a DataFrame. In this course, our datasets are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following pandas function.  pd.read_csv("data/elections.csv")

This code stores our DataFrame object in the elections variable. We see that our elections DataFrame has 182 rows and 6 columns (Year, Candidate, Party, Popular Vote, Result, %). Each row represents a single record – in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.

In the example above, we constructed a DataFrame object using data from a CSV file. As we’ll explore in the next section, we can also create a DataFrame with data of our own.

In the elections dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named Result stores whether or not the candidate won the election.

Some relevant arguments for read_csv are:



Arguments

Separator

The sep argument specifies the character used to separate the values in the CSV file. By default, this is a comma ,. However, some CSV files use other characters, such as tabs or semicolons, to separate values. In such cases, we can specify the separator character using the sep argument.


Header row (column labels)

The header argument specifies the row number to use as the column names. By default, this is 0, which means the first row is used as the column names. If the CSV file does not have a header row, we can set header=None to use the default column names.


Index column (row labels)

The index_col argument specifies the column to use as the row labels of the DataFrame. By default, this is None, which means that the row labels are integers starting from 0. If we want to use one of the columns as the row labels, we can specify the column name or index using the index_col argument.


Ignore erroneous lines

The error_bad_lines argument specifies whether the parser should skip lines with too many fields rather than raising an error. By default, this is False, which means that the parser will raise an error if it encounters a line with too many fields. If we want the parser to skip such lines, we can set error_bad_lines=True.


Skip first k rows

The skiprows argument specifies the number of rows to skip at the beginning of the CSV file. By default, this is None, which means that no rows are skipped. If we want to skip a certain number of rows at the beginning of the file, we can specify the number of rows using the skiprows argument.


Read only k rows

The nrows argument specifies the number of rows to read from the CSV file. By default, this is None, which means that all rows are read. If we want to read only a certain number of rows from the file, we can specify the number of rows using the nrows argument.


Read only subset of columns

The usecols argument specifies the columns to read from the CSV file. By default, this is None, which means that all columns are read. If we want to read only certain columns from the file, we can specify the column names or indices using the usecols argument.


Character Encoding

The encoding argument specifies the character encoding to use when reading the CSV file. By default, this is None, which means that the encoding is detected automatically. If the CSV file uses a different character encoding, we can specify the encoding using the encoding argument. Some common character encodings are:

  • utf-8: Unicode Transformation Format 8-bit
  • latin1 or iso-8859-1: ISO 8859-1 supports many languages, including English, French, German, Spanish, and Portuguese.
  • cp1252: The cp1252 encoding is similar to the latin1 encoding, but it includes additional characters that are not present in latin1. The cp1252 encoding is commonly used in Windows operating systems.
  • ascii: The ASCII encoding is a 7-bit encoding that supports English characters and some special characters, such as punctuation marks and symbols.