from matplotlib import pyplot as plt
import pandas as pd
= "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"
url
= pd.read_csv(url) elections
Histogram (Distribution of Continuous Data)
A histogram is a graphical representation that organizes a group of data points into ranges.
These ranges are typically called “bins”. By default, bins are of equal size and are computed based on the difference between the maximum and minimum values in the dataset.
i.e.
\[ \text{Bin Width} = \frac{\text{Max Value} - \text{Min Value}}{\text{Number of Bins}} \]
Each bin:
\[ [\text{\color{yellow}{Bin Start}}, \text{\color{red}{Bin End}}) = [\color{yellow}{\text{Min Value} + i \times \text{Bin Width}}, \color{red}{\text{Min Value} + (i + 1) \times \text{Bin Width}}) \]
where \(i\) is the bin index starting from 0.
Alternatively, progrmmatically, it can be calculated using the following pseudocode:
= [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
data
= 4
number_of_bins
= max(data)
max_value = min(data)
min_value
= max_value - min_value / number_of_bins
bin_width
for i in range(number_of_bins):
= min_value + i * bin_width
bin_start = min_value + (i + 1) * bin_width
bin_end print(f"Bin {i}: [{bin_start}, {bin_end})")
It is used to visualize the distribution of numerical data by showing the frequency of data points that fall within each range (or “bin”).
By default, histograms display the frequency (count) of data points in each bin. However, they can also be configured to show relative frequency (proportion of total data points) or density (frequency per unit on the x-axis).
elections.head()
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
= elections[elections['Result'] == 'win']
winners
= plt.subplots()
fig, ax
'%'], bins=10, edgecolor='black');
ax.hist(winners['Winning Percentages in Elections')
ax.set_title('Winning Percentage')
ax.set_xlabel('Number of Elections'); ax.set_ylabel(
2D Histogram
A 2D histogram is an extension of the traditional histogram that represents the joint distribution of two continuous variables.
In a 2D histogram, the data is divided into bins along both the x-axis and y-axis, creating a grid of rectangular bins. Each bin counts the number of data points that fall within its range for both variables.
The height (or color intensity) of each bin represents the frequency of data points that fall within that bin’s range for both variables.
For example, consider a dataset with two continuous variables, X and Y. A 2D histogram would divide the range of X into several bins along the x-axis and the range of Y into several bins along the y-axis. Each bin in the resulting grid would then count how many data points fall within the corresponding ranges of X and Y.
The resulting 2D histogram can be visualized using a heatmap, where the color intensity of each bin indicates the frequency of data points in that bin. This allows for easy identification of patterns, correlations, and distributions between the two variables.
from matplotlib import pyplot as plt
import pandas as pd
= 'https://raw.githubusercontent.com/fahadsultan/csc343/refs/heads/main/data/uscities.csv'
url = pd.read_csv(url)
data = data[(data['state_id'] != 'HI') & \
us_mainland 'state_id'] != 'AK') & \
(data['state_id'] != 'PR')]
(data[
= plt.subplots(2, 1, figsize=(7, 12), sharex=True, sharey=True)
fig, ax
0].scatter(us_mainland['lng'], us_mainland['lat'], s=1)
ax[0].set_title('Scatter Plot of US Cities')
ax[0].set_xlabel('Longitude')
ax[0].set_ylabel('Latitude')
ax[
= ax[1].hist2d(us_mainland['lng'], us_mainland['lat'], bins=20, cmap='Blues')
a 1].set_title('2D Histogram of US Cities')
ax[1].set_xlabel('Longitude')
ax[1].set_ylabel('Latitude')
ax[3], ax=ax[1])
fig.colorbar(a[
plt.show()
In the code above, the bins
parameter is set to 30, which means that the data will be divided into 30 bins along both the x-axis and y-axis. The cmap
parameter is set to ‘Reds’, which specifies the color map used to represent the frequency of data points in each bin. The plt.colorbar()
function adds a color bar to the plot, which indicates the frequency scale.
plt.colorbar()
takes as an argument the result of the plt.hist2d()
function, which is a QuadMesh
object representing the 2D histogram. The color bar provides a reference for interpreting the color intensity in the histogram, allowing viewers to understand the frequency of data points in each bin based on the color scale.