# import libraries
from matplotlib import pyplot as plt
import pandas as pd
'dark_background')
plt.style.use(
# load us cities data
= 'https://raw.githubusercontent.com/fahadsultan/csc343/refs/heads/main/data/uscities.csv'
url = pd.read_csv(url)
data = data[(data['state_id'] != 'HI') & \
us_mainland 'state_id'] != 'AK') & \
(data['state_id'] != 'PR')]
(data[
# creating figure and axis
= plt.subplots(figsize=(8, 5))
fig, ax
# scatter plot
'lng'], us_mainland['lat'], s=1);
ax.scatter(us_mainland[
# setting labels and title
'Longitude');
ax.set_xlabel('Latitude');
ax.set_ylabel('Cities in the US Mainland'); ax.set_title(
Scatter Plots (\(\geq\) 2 numeric variables)
Scatter plots are a great way to give you a sense of trends, concentrations, and outliers. This notebook will show you how to create scatter plots using Matplotlib.
A scatter plot uses dots to represent values for two (or more) different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.
Scatter plots are used when you want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated.
Creating a Scatter Plot
To create a scatter plot, we can use the scatter()
function from the Matplotlib library. The scatter()
function takes two arguments: the x-axis values and the y-axis values.
Here is an example of how to create a simple line plot using the Matplotlib library:
Marker Size
The size of the markers can be adjusted using the s
parameter. This parameter controls the size of the markers. The default value is s=20
.
# creating figure and axis
= plt.subplots(1, 2, figsize=(10, 3))
fig, ax
# scatter plot
0].scatter(us_mainland['lng'], us_mainland['lat'], s=0.1);
ax[1].scatter(us_mainland['lng'], us_mainland['lat'], s=5);
ax[
# setting labels and title
0].set_title('s=0.1');
ax[1].set_title('s=5'); ax[
The s
parameter accepts a scalar or an array of the same length as the number of data points.
# creating figure and axis
= plt.subplots(figsize=(8, 5))
fig, ax
= 1/50_000
scaling_factor # scatter plot
'lng'], us_mainland['lat'], s=us_mainland['population']*scaling_factor);
ax.scatter(us_mainland[
# setting labels and title
'Cities in the US Mainland, Size proportional to Population'); ax.set_title(
Marker Color
The color of the markers can be adjusted using the c
parameter. The c
parameter accepts a scalar or an array of the same length as the number of data points. This parameter controls the color of the markers. The default value is c='b'
(blue).
= us_mainland[us_mainland['state_id'] == 'SC']
sc = us_mainland[us_mainland['state_id'] == 'GA']
ga
# creating figure and axis
= plt.subplots(figsize=(8, 5))
fig, ax
# scatter plot
'lng'], sc['lat'], c='red', label='South Carolina');
ax.scatter(sc['lng'], ga['lat'], c='blue', label='Georgia');
ax.scatter(ga[
# setting labels and title
'Cities in South Carolina and Georgia');
ax.set_title(
=12); ax.legend(fontsize
= us_mainland[us_mainland['state_id'] == 'SC']
sc = us_mainland[us_mainland['state_id'] == 'GA']
ga
# creating figure and axis
= plt.subplots(figsize=(8, 5))
fig, ax
# scatter plot
= ax.scatter(sc['lng'], sc['lat'], c=sc['density'], cmap='Reds');
sc_plt
# setting labels and title
'Cities in South Carolina, color proportional to Density');
ax.set_title(
='Density'); plt.colorbar(sc_plt, label
Marker Shape
The shape of the markers can be adjusted using the marker parameter. The marker parameter accepts a string that specifies the shape of the markers. The default value is marker=‘o’ (circle).
Here are some of the marker shapes that you can use:
- ‘o’ - Circle
- ‘s’ - Square
- ‘^’ - Triangle
- ‘v’ - Inverted Triangle
- ‘x’ - X
- ‘+’ - Plus
- ’*’ - Star
- ‘D’ - Diamond
- ‘d’ - Thin Diamond
- ‘p’ - Pentagon
- ‘h’ - Hexagon
- ‘H’ - Rotated Hexagon
- ‘<’ - Left Triangle
- ‘>’ - Right Triangle
= plt.subplots(figsize=(8, 5))
fig, ax
'lng'], sc['lat'], label='South Carolina', color='red', marker='x');
ax.scatter(sc['lng'], ga['lat'], label='Georgia', color='blue', marker='^');
ax.scatter(ga[
=12); ax.legend(fontsize
Pandas and Seaborn
Pandas and Seaborn are great libraries for data manipulation and data visualization, respectively. Pandas is used to load and manipulate data, while Seaborn is used to visualize data. Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.
Creating a scatter plot using seaborn and pandas requires fewer lines of code compared to using Matplotlib.
= plt.subplots(figsize=(8, 5))
fig, ax
='scatter', \
sc.plot(kind='lng', \
x='lat', \
y=sc['population']/1000, \
s=ax, \
ax='red', \
color='South Carolina'); label
Scatter plots can be created using the scatterplot()
function from the Seaborn library. The scatterplot()
function takes two arguments: the x-axis values and the y-axis values.
Note that when using seaborn and pandas, the axis labels are automatically set to the column names of the DataFrame.
import seaborn as sns
= plt.subplots(figsize=(8, 5))
fig, ax
=us_mainland, \
sns.scatterplot(data='lng', \
x='lat', \
y='population', \
size= (1, 1000), \
sizes ='state_id', \
hue=ax, \
ax=False); legend
Please don’t
Overplot
Overplotting occurs when two or more data points are plotted on top of each other. This can make it difficult to see the true distribution of the data. Overplotting can be avoided by adjusting the transparency of the markers using the alpha parameter. The alpha parameter accepts a scalar between 0 and 1. This parameter controls the transparency of the markers. The default value is alpha=1.
= pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/refs/heads/main/data/loan_approval_dataset.csv')
data
= plt.subplots(figsize=(8, 5))
fig, ax
'cibil_score'], data['income_annum']);
ax.scatter(data[
'Credit Score');
ax.set_xlabel('Annual Income'); ax.set_ylabel(
Use Scatter for Categorical Data
Scatter plots are used to show the relationship between two numeric variables. If you have categorical data, you should use a different type of plot, such as a bar plot or a box plot.
= plt.subplots()
fig, ax
'loan_term'], data['loan_amount'], alpha=0.5)
ax.scatter(data[
'Loan Amount');
ax.set_ylabel('Loan Term'); ax.set_xlabel(