Histogram (Distribution of Continuous Data)

A histogram is a graphical representation that organizes a group of data points into ranges.

These ranges are typically called “bins”. By default, bins are of equal size and are computed based on the difference between the maximum and minimum values in the dataset.

i.e. 

\[ \text{Bin Width} = \frac{\text{Max Value} - \text{Min Value}}{\text{Number of Bins}} \]

Each bin:

\[ [\text{\color{yellow}{Bin Start}}, \text{\color{red}{Bin End}}) = [\color{yellow}{\text{Min Value} + i \times \text{Bin Width}}, \color{red}{\text{Min Value} + (i + 1) \times \text{Bin Width}}) \]

where \(i\) is the bin index starting from 0.

Alternatively, progrmmatically, it can be calculated using the following pseudocode:


data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]

number_of_bins = 4

max_value = max(data)
min_value = min(data)

bin_width = max_value - min_value / number_of_bins

for i in range(number_of_bins):
    bin_start = min_value + i * bin_width
    bin_end = min_value + (i + 1) * bin_width
    print(f"Bin {i}: [{bin_start}, {bin_end})")

It is used to visualize the distribution of numerical data by showing the frequency of data points that fall within each range (or “bin”).

By default, histograms display the frequency (count) of data points in each bin. However, they can also be configured to show relative frequency (proportion of total data points) or density (frequency per unit on the x-axis).

from matplotlib import pyplot as plt
import pandas as pd 

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)
elections.head()
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
winners = elections[elections['Result'] == 'win']

fig, ax = plt.subplots()

ax.hist(winners['%'], bins=10, edgecolor='black');
ax.set_title('Winning Percentages in Elections')
ax.set_xlabel('Winning Percentage')
ax.set_ylabel('Number of Elections');

2D Histogram

A 2D histogram is an extension of the traditional histogram that represents the joint distribution of two continuous variables.

In a 2D histogram, the data is divided into bins along both the x-axis and y-axis, creating a grid of rectangular bins. Each bin counts the number of data points that fall within its range for both variables.

The height (or color intensity) of each bin represents the frequency of data points that fall within that bin’s range for both variables.

For example, consider a dataset with two continuous variables, X and Y. A 2D histogram would divide the range of X into several bins along the x-axis and the range of Y into several bins along the y-axis. Each bin in the resulting grid would then count how many data points fall within the corresponding ranges of X and Y.

The resulting 2D histogram can be visualized using a heatmap, where the color intensity of each bin indicates the frequency of data points in that bin. This allows for easy identification of patterns, correlations, and distributions between the two variables.

from matplotlib import pyplot as plt
import pandas as pd

url  = 'https://raw.githubusercontent.com/fahadsultan/csc343/refs/heads/main/data/uscities.csv'
data = pd.read_csv(url)
us_mainland = data[(data['state_id'] != 'HI') & \
                   (data['state_id'] != 'AK') & \
                   (data['state_id'] != 'PR')]

fig, ax = plt.subplots(2, 1, figsize=(7, 12), sharex=True, sharey=True)

ax[0].scatter(us_mainland['lng'], us_mainland['lat'], s=1)
ax[0].set_title('Scatter Plot of US Cities')
ax[0].set_xlabel('Longitude')
ax[0].set_ylabel('Latitude')

a = ax[1].hist2d(us_mainland['lng'], us_mainland['lat'], bins=20, cmap='Blues')
ax[1].set_title('2D Histogram of US Cities')
ax[1].set_xlabel('Longitude')
ax[1].set_ylabel('Latitude')
fig.colorbar(a[3], ax=ax[1])

plt.show()

In the code above, the bins parameter is set to 30, which means that the data will be divided into 30 bins along both the x-axis and y-axis. The cmap parameter is set to ‘Reds’, which specifies the color map used to represent the frequency of data points in each bin. The plt.colorbar() function adds a color bar to the plot, which indicates the frequency scale.

plt.colorbar() takes as an argument the result of the plt.hist2d() function, which is a QuadMesh object representing the 2D histogram. The color bar provides a reference for interpreting the color intensity in the histogram, allowing viewers to understand the frequency of data points in each bin based on the color scale.