Scatter Plots (\(\geq\) 2 numeric variables)

Scatter plots are a great way to give you a sense of trends, concentrations, and outliers. This notebook will show you how to create scatter plots using Matplotlib.

A scatter plot uses dots to represent values for two (or more) different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

Scatter plots are used when you want to show the relationship between two variables. Scatter plots are sometimes called correlation plots because they show how two variables are correlated.

Creating a Scatter Plot

To create a scatter plot, we can use the scatter() function from the Matplotlib library. The scatter() function takes two arguments: the x-axis values and the y-axis values.

Here is an example of how to create a simple line plot using the Matplotlib library:

# import libraries
from matplotlib import pyplot as plt 
import pandas as pd

plt.style.use('dark_background')

# load us cities data
url  = 'https://raw.githubusercontent.com/fahadsultan/csc343/refs/heads/main/data/uscities.csv'
data = pd.read_csv(url)
us_mainland = data[(data['state_id'] != 'HI') & \
                   (data['state_id'] != 'AK') & \
                   (data['state_id'] != 'PR')]

# creating figure and axis
fig, ax = plt.subplots(figsize=(8, 5))

# scatter plot
ax.scatter(us_mainland['lng'], us_mainland['lat'], s=1);

# setting labels and title
ax.set_xlabel('Longitude');
ax.set_ylabel('Latitude');
ax.set_title('Cities in the US Mainland');

Marker Size

The size of the markers can be adjusted using the s parameter. This parameter controls the size of the markers. The default value is s=20.

# creating figure and axis
fig, ax = plt.subplots(1, 2, figsize=(10, 3))

# scatter plot
ax[0].scatter(us_mainland['lng'], us_mainland['lat'], s=0.1);
ax[1].scatter(us_mainland['lng'], us_mainland['lat'], s=5);

# setting labels and title
ax[0].set_title('s=0.1');
ax[1].set_title('s=5');

The s parameter accepts a scalar or an array of the same length as the number of data points.

# creating figure and axis
fig, ax = plt.subplots(figsize=(8, 5))

scaling_factor = 1/50_000
# scatter plot
ax.scatter(us_mainland['lng'], us_mainland['lat'], s=us_mainland['population']*scaling_factor);

# setting labels and title
ax.set_title('Cities in the US Mainland, Size proportional to Population');

Marker Color

The color of the markers can be adjusted using the c parameter. The c parameter accepts a scalar or an array of the same length as the number of data points. This parameter controls the color of the markers. The default value is c='b' (blue).

sc = us_mainland[us_mainland['state_id'] == 'SC']
ga = us_mainland[us_mainland['state_id'] == 'GA']

# creating figure and axis
fig, ax = plt.subplots(figsize=(8, 5))

# scatter plot
ax.scatter(sc['lng'], sc['lat'], c='red', label='South Carolina');
ax.scatter(ga['lng'], ga['lat'], c='blue', label='Georgia');

# setting labels and title
ax.set_title('Cities in South Carolina and Georgia');

ax.legend(fontsize=12);

sc = us_mainland[us_mainland['state_id'] == 'SC']
ga = us_mainland[us_mainland['state_id'] == 'GA']

# creating figure and axis
fig, ax = plt.subplots(figsize=(8, 5))

# scatter plot
sc_plt = ax.scatter(sc['lng'], sc['lat'], c=sc['density'], cmap='Reds'); 

# setting labels and title
ax.set_title('Cities in South Carolina, color proportional to Density');

plt.colorbar(sc_plt, label='Density');

Marker Shape

The shape of the markers can be adjusted using the marker parameter. The marker parameter accepts a string that specifies the shape of the markers. The default value is marker=‘o’ (circle).

Here are some of the marker shapes that you can use:

‘o’ - Circle
‘s’ - Square
‘^’ - Triangle
‘v’ - Inverted Triangle
‘x’ - X
‘+’ - Plus
’*’ - Star
‘D’ - Diamond
‘d’ - Thin Diamond
‘p’ - Pentagon
‘h’ - Hexagon
‘H’ - Rotated Hexagon
‘<’ - Left Triangle
‘>’ - Right Triangle

fig, ax = plt.subplots(figsize=(8, 5))

ax.scatter(sc['lng'], sc['lat'], label='South Carolina', color='red',  marker='x');
ax.scatter(ga['lng'], ga['lat'], label='Georgia',        color='blue', marker='^');

ax.legend(fontsize=12);

Pandas and Seaborn

Pandas and Seaborn are great libraries for data manipulation and data visualization, respectively. Pandas is used to load and manipulate data, while Seaborn is used to visualize data. Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.

Creating a scatter plot using seaborn and pandas requires fewer lines of code compared to using Matplotlib.

fig, ax = plt.subplots(figsize=(8, 5))

sc.plot(kind='scatter', \
        x='lng', \
        y='lat', \
        s=sc['population']/1000, \
        ax=ax, \
        color='red', \
        label='South Carolina');

Scatter plots can be created using the scatterplot() function from the Seaborn library. The scatterplot() function takes two arguments: the x-axis values and the y-axis values.

Note that when using seaborn and pandas, the axis labels are automatically set to the column names of the DataFrame.

import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 5))

sns.scatterplot(data=us_mainland, \
                x='lng', \
                y='lat', \
                size='population', \
                sizes = (1, 1000), \
                hue='state_id', \
                ax=ax, \
                legend=False);

Please don’t

Overplot

Overplotting occurs when two or more data points are plotted on top of each other. This can make it difficult to see the true distribution of the data. Overplotting can be avoided by adjusting the transparency of the markers using the alpha parameter. The alpha parameter accepts a scalar between 0 and 1. This parameter controls the transparency of the markers. The default value is alpha=1.

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/refs/heads/main/data/loan_approval_dataset.csv')

fig, ax = plt.subplots(figsize=(8, 5))

ax.scatter(data['cibil_score'], data['income_annum']);

ax.set_xlabel('Credit Score');
ax.set_ylabel('Annual Income');

Use Scatter for Categorical Data

Scatter plots are used to show the relationship between two numeric variables. If you have categorical data, you should use a different type of plot, such as a bar plot or a box plot.

fig, ax = plt.subplots()

ax.scatter(data['loan_term'], data['loan_amount'], alpha=0.5)

ax.set_ylabel('Loan Amount');
ax.set_xlabel('Loan Term');