Descriptive Statistics

In data science, we often want to compute summary statistics. pandas provides a number of built-in methods for this purpose.

For example, we can use the .mean(), .median() and .std() methods to compute the mean, median, and standard deviation of a column, respectively.

import pandas as pd 

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)

elections.head()
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789

Central Tendency

The mean, median, and mode are three measures of central tendency. They are used to describe the center of a data set.

Mean

The mean is the average of all the values in a data set. It is calculated by adding up all the values and then dividing by the number of values.

\[ \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n} \]

where \(x_i\) is the \(i\)-th value in the data set and \(n\) is the number of values.

elections['%'].mean(), sum(elections['%'])/len(elections)
(27.470350372043967, 27.470350372043967)

Median

The median is the middle value in a data set when the values are ordered from smallest to largest. If there is an even number of values, the median is the average of the two middle values.

\[ \text{Median} = \begin{cases} x_{(n+1)/2} & \text{if $n$ is odd} \\ \frac{x_{n/2} + x_{n/2+1}}{2} & \text{if $n$ is even} \end{cases} \]

where \(x_{(n+1)/2}\) is the middle value when \(n\) is odd, and \(x_{n/2}\) and \(x_{n/2+1}\) are the two middle values when \(n\) is even.

elections['%'].median(), elections['%'].quantile(0.5)
(37.67789306, 37.67789306)

Mode

The mode is the value that appears most frequently in a data set. A data set can have one mode, more than one mode, or no mode at all.

\[ \text{Mode} = \text{value that appears most frequently} \]

elections['Party'].mode()
0    Democratic
Name: Party, dtype: object
elections['Party'].value_counts().idxmax()
'Democratic'

Dispersion

Dispersion refers to the spread of values in a data set. Measures of dispersion include the range, variance, and standard deviation.

Range

The range is the difference between the largest and smallest values in a data set.

\[ \text{Range} = \text{Largest value} - \text{Smallest value} \]

Variance

The variance is a measure of how spread out the values in a data set are. It is calculated by taking the average of the squared differences between each value and the mean.

\[ \text{Variance} = \frac{\sum_{i=1}^{n} (x_i - \text{Mean})^2}{n} \]

where \(x_i\) is the \(i\)-th value in the data set, \(\text{Mean}\) is the mean of the data set, and \(n\) is the number of values.

Standard Deviation

The standard deviation is the square root of the variance. It is a measure of how spread out the values in a data set are relative to the mean.

\[ \text{Standard Deviation} = \sqrt{\text{Variance}} \]

Skewness

Skewness is a measure of the asymmetry of a distribution. It can be positive, negative, or zero.

Positive skewness: The distribution is skewed to the right, with a long tail on the right side.

Negative skewness: The distribution is skewed to the left, with a long tail on the left side.

Zero skewness: The distribution is symmetric.

The skewness of a distribution can be computed using the .skew() method.

\[ \text{Skewness} = \frac{\sum_{i=1}^{n} (x_i - \mu)^3}{n \sigma^3} \]

where \(x_i\) is the \(i\)-th value in the data set, $ $ is the mean of the data set, \(n\) is the number of values, and $ $ is the standard deviation of the data set.

Similarly, we can use the .max() and .min() methods to compute the maximum and minimum values of a Series or DataFrame.

elections['%'].max(), elections['%'].min()
(61.34470329, 0.098088334)

The .sum() method computes the sum of all the values in a Series or DataFrame.

The .describe() method computes summary statistics for a Series or DataFrame. It computes the mean, standard deviation, minimum, maximum, and the quantiles of the data.

elections['%'].describe()
count    182.000000
mean      27.470350
std       22.968034
min        0.098088
25%        1.219996
50%       37.677893
75%       48.354977
max       61.344703
Name: %, dtype: float64
elections.describe()
Year Popular vote %
count 182.000000 1.820000e+02 182.000000
mean 1934.087912 1.235364e+07 27.470350
std 57.048908 1.907715e+07 22.968034
min 1824.000000 1.007150e+05 0.098088
25% 1889.000000 3.876395e+05 1.219996
50% 1936.000000 1.709375e+06 37.677893
75% 1988.000000 1.897775e+07 48.354977
max 2020.000000 8.126892e+07 61.344703

.describe()

If many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() can be used to compute all of them at once.

elections.describe()
Year Popular vote %
count 182.000000 1.820000e+02 182.000000
mean 1934.087912 1.235364e+07 27.470350
std 57.048908 1.907715e+07 22.968034
min 1824.000000 1.007150e+05 0.098088
25% 1889.000000 3.876395e+05 1.219996
50% 1936.000000 1.709375e+06 37.677893
75% 1988.000000 1.897775e+07 48.354977
max 2020.000000 8.126892e+07 61.344703

A different set of statistics will be reported if .describe() is called on a Series.

elections["Party"].describe()
count            182
unique            36
top       Democratic
freq              47
Name: Party, dtype: object
elections["Popular vote"].describe().astype(int)
count         182
mean     12353635
std      19077149
min        100715
25%        387639
50%       1709375
75%      18977751
max      81268924
Name: Popular vote, dtype: int64
x = elections['%']
y = elections['Popular vote']

cov = sum((x-x.mean()) * (y-y.mean())) / len(x)

round(cov)
243614836
x.cov(y), round(x.cov(y, ddof=0))
(244960774.11030602, 243614836)

Percentile and Quantile

Percentiles and quantiles are measures of position in a data set. They divide the data set into equal parts.

Percentile

A percentile is a value below which a given percentage of the data falls. For example, the 25th percentile is the value below which 25% of the data falls.

Quantile

A quantile is a value below which a given fraction of the data falls. For example, the 0.25 quantile is the value below which 25% of the data falls.

The .quantile() method can be used to compute the quantiles of a Series or DataFrame.

elections.quantile(0.25)
Year              1889.000000
Popular vote    387639.500000
%                    1.219996
Name: 0.25, dtype: float64