Box Plots

When visualizing distributions of continuous variables, across multiple categories, bar plots can be misleading. They fail to convey the spread of the data, and can hide important information about the distribution.

Box plots display distributions using information about quartiles.

A quartile represents a 25% portion of the data.

We say that:

First quartile (Q1)

Repesents the 25th percentile – 25% of the data lies below the first quartile.

In a box plot, the lower extent of the box lies at Q1.

Second quartile (Q2) aka Median

Represents the 50th percentile, also known as the median – 50% of the data lies below the second quartile

In a box plot, the horizontal line in the middle of the box corresponds to Q2 (equivalently, the median).

Third quartile (Q3)

Represents the 75th percentile – 75% of the data lies below the third quartile.

In a box plot, the upper extent of the box lies at Q3.

Inter-Quartile Range (IQR) and Whiskers

In a box plot, the lower extent of the box lies at Q1, while the upper extent of the box lies at Q3. The horizontal line in the middle of the box corresponds to Q2 (equivalently, the median).

The Inter-Quartile Range (IQR) measures the spread of the middle % of the distribution, calculated as the (\(3^{rd}\) Quartile \(-\) \(1^{st}\) Quartile).

\[ IQR = Q3 - Q1 \]

The whiskers of a box-plot are the two points that lie at the [\(1^{st}\) Quartile \(-\)(\(1.5 \times\) IQR)], and the [\(3^{rd}\) Quartile \(+\) (\(1.5 \times\) IQR)]. They are the lower and upper ranges of “normal” data (the points excluding outliers). Subsequently, the outliers are the data points that fall beyond the whiskers, or further than ( \(1.5 \times\) IQR) from the extreme quartiles.

\[ \text{Lower Whisker} = Q1 - (1.5 \times IQR) \]

\[ \text{Upper Whisker} = Q3 + (1.5 \times IQR) \]

Outliers

An outlier is a data point that lies outside the overall pattern of the distribution. In a box plot, outliers are represented as individual points that fall beyond the whiskers.

\[ \text{Outlier} < Q1 - (1.5 \times IQR) \]

\[ \text{Outlier} > Q3 + (1.5 \times IQR) \]

Example

import pandas as pd 

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)

elections.head()

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122
1	1824	John Quincy Adams	Democratic-Republican	113142	win	42.789878
2	1828	Andrew Jackson	Democratic	642806	win	56.203927
3	1828	John Quincy Adams	National Republican	500897	loss	43.796073
4	1832	Andrew Jackson	Democratic	702735	win	54.574789

from matplotlib import pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(10, 4))

axs[0].boxplot(elections["Popular vote"], labels=["% Votes"])
axs[0].set_title("Boxplot of Total Votes")
axs[0].set_ylabel("Total Votes")
axs[0].set_xlabel("");

axs[1].hist(elections["Popular vote"], bins=20, edgecolor='black')
axs[1].set_title("Histogram of Total Votes")
axs[1].set_xlabel("Total Votes")
axs[1].set_ylabel("Frequency");


# Box plot for each Result 

fig, ax = plt.subplots(1, 2, figsize=(8, 4))

win = elections[elections["Result"] == "win"]
loss = elections[elections["Result"] == "loss"]

ax[0].boxplot([win["%"], loss["%"]],
           labels=["Won", "Lost"],
           showfliers=False);

ax[0].set_title("Boxplot of % by Election Result")
ax[0].set_ylabel("Total Votes")


ax[1].hist(win["%"], alpha=0.7, label="Won", bins=10);
ax[1].hist(loss["%"], alpha=0.7, label="Lost", bins=10);
ax[1].legend();

ax[1].set_xlabel("% of Votes")
ax[1].set_title("Histogram of % by Election Result")
ax[1].set_ylabel("Total Votes")
plt.show()

plt.show()