import pandas as pd
= pd.read_csv('../data/elections.csv')
elections
def extract_first_name(name):
# name is of type string e.g. "Andrew Jackson"
= name.split(" ")
space_separated_substrings = space_separated_substrings[0]
first_name return first_name
= elections['Candidate']
candidate_column 'first_name'] = candidate_column.apply(extract_first_name) elections[
Applying custom functions
Applying a custom user-defined function to a DataFrame is a common operation in pandas
.
It allows us to implement custom transformations on the data that are not available in the standard library.
The apply
method can be used in the following way:
- on each value (scalar) of a pd.Series:
df[col].apply(func)
- on each column of a pd.DataFrame:
df.apply(func, axis=0)
- on each row of a pd.DataFrame:
df.apply(func, axis=1)
Each of these operations is distinct in terms of the shape of the output and the arguments that the function being applied takes.
code | x |
argument to the function f |
Expected return type of f |
y |
|
---|---|---|---|---|---|
1 |
x=df[col] y=x.apply(f) |
pd.Series of length \(n\) |
scalar |
scalar |
pd.Series of length \(n\) |
2 | y=x.apply(f,axis=0) |
pd.DataFrame of shape \((n, d)\) |
pd.Series of length \(n\) | scalar | pd.Series of length \(d\) |
3 | y=x.apply(f,axis=1) |
pd.DataFrame of shape \((n, d)\) |
pd.Series of length \(d\) | scalar | pd.Series of length \(n\) |
1. Applying a function to each value of a pd.Series
In our elections dataset, we have a column Candidate
that contains the name of the candidate that each vote was cast for.
Let’s say we wanted to extract the first name of each candidate. We can use the apply
method to apply a custom function that extracts the first name from a string.
Note that the function extract_first_name
takes a scalar input (a string): Candidate
and returns a scalar output (a string): first_name
.
Our function extract_first_name
is called on each value of the Candidate
column so \(n\) times where \(n\) is the number of rows in the DataFrame.
The function extract_first_name
returns back a scalar value for each input value. All together, we get a pd.Series of length \(n\) where each value is the first name of the candidate.
2. Applying a function to each column of a pd.DataFrame (axis=0
)
data.apply(f, axis=0)
applies the function f
to each column of the DataFrame data
.
For example, if we wanted to find the number of unique values in each column of a DataFrame data
, we could use the following code:
def count_unique(col):
return len(set(col))
apply(count_unique, axis="index") # function is passed an individual column elections.
3. Applying a function to each row of a pd.DataFrame (axis=1
)
data.apply(f, axis=1)
applies the function f
to each row of the DataFrame data
.
For instance, let’s say we wanted to count the total number of voters in an election.
We can use .apply
to answer that question using the following formula:
\[ \text{total} \times \frac{\%}{100} = \text{Popular vote} \]
def compute_total(row):
return int(row['Popular vote']*100/row['%'])
apply(compute_total, axis=1) elections.
0 264413
1 264412
2 1143702
3 1143703
4 1287655
...
177 135720167
178 158383403
179 158383403
180 158383401
181 158383402
Length: 182, dtype: int64
Anonymous functions
Anonymous functions are functions that are defined without a name.
In Python, we use the lambda
keyword to create anonymous functions.
The syntax of a lambda function is:
lambda arguments: expression
Lambda functions can have any number of arguments but only one expression, which is evaluated and returned, without using the return
keyword.
They are useful when we need a simple function that we will only use once.
For example, let’s say we wanted to add 1 to each element in a DataFrame. We could use an anonymous function to do this:
= lambda x: x * 2
double
5) # returns 10
double(
= double
multiplyby2
5) # returns 10 multiplyby2(
Note that in the above example, double
is not a function name per se. It is a variable that refers to the anonymous function lambda x: x * 2
.
Anonymous functions with .apply
We can use anonymous functions with .apply
to apply a function to each column or row of a DataFrame.
For example, let’s say we wanted to find the number of unique values in each column of a DataFrame data
. We could use the following code:
apply(lambda x: x.nunique()) data.
Let’s say we wanted to count the total number of voters in an election. We can use .apply
to answer that question using the following formula:
'Popular vote'] = data.apply(lambda row: row['total'] * row['%'] / 100, axis=1) data[