4.2. Multivariate Visualizations#

Up until now, we’ve discussed how to visualize single-feature distributions. Now, let’s understand how to visualize the relationship between more than one features.

We will continue to use the World Bank dataset, which contains information from 2015/16 about countries around the world. We will use the same features as before: GDP per capita, life expectancy, and population.

import pandas as pd 
import seaborn as sns 
from matplotlib import pyplot as plt

sns.set_style('whitegrid')

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/world_bank.csv', index_col=0)
data.head()
Continent Country Primary completion rate: Male: % of relevant age group: 2015 Primary completion rate: Female: % of relevant age group: 2015 Lower secondary completion rate: Male: % of relevant age group: 2015 Lower secondary completion rate: Female: % of relevant age group: 2015 Youth literacy rate: Male: % of ages 15-24: 2005-14 Youth literacy rate: Female: % of ages 15-24: 2005-14 Adult literacy rate: Male: % ages 15 and older: 2005-14 Adult literacy rate: Female: % ages 15 and older: 2005-14 ... Access to improved sanitation facilities: % of population: 1990 Access to improved sanitation facilities: % of population: 2015 Child immunization rate: Measles: % of children ages 12-23 months: 2015 Child immunization rate: DTP3: % of children ages 12-23 months: 2015 Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 Children sleeping under treated bed nets: % of children under age 5: 2009-2016 Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 Tuberculosis: Treatment success rate: % of new cases: 2014 Tuberculosis: Cases detection rate: % of new estimated cases: 2015
0 Africa Algeria 106.0 105.0 68.0 85.0 96.0 92.0 83.0 68.0 ... 80.0 88.0 95.0 95.0 66.0 42.0 NaN NaN 88.0 80.0
1 Africa Angola NaN NaN NaN NaN 79.0 67.0 82.0 60.0 ... 22.0 52.0 55.0 64.0 NaN NaN 25.9 28.3 34.0 64.0
2 Africa Benin 83.0 73.0 50.0 37.0 55.0 31.0 41.0 18.0 ... 7.0 20.0 75.0 79.0 23.0 33.0 72.7 25.9 89.0 61.0
3 Africa Botswana 98.0 101.0 86.0 87.0 96.0 99.0 87.0 89.0 ... 39.0 63.0 97.0 95.0 NaN NaN NaN NaN 77.0 62.0
5 Africa Burundi 58.0 66.0 35.0 30.0 90.0 88.0 89.0 85.0 ... 42.0 48.0 93.0 94.0 55.0 43.0 53.8 25.4 91.0 51.0

5 rows × 47 columns

4.2.1. Distribution of a numeric feature w.r.t a categorical features#

Let’s start by visualizing the distribution of a numeric feature, across the categories defined by a categorical feature. In other words, we want to visualize the distribution of a numeric feature, separately for each category of another categorical feature.

4.2.1.1. Overlaid Histograms (1 numeric, 1 categorical)#

We can use a histogram to visualize the distribution of a numeric variable. To visualize how this distribution differs between the groups created by another categorical variable, we can create a histogram for each group separately.

In order to create overlaid histograms, we will continue to use sns.histplot. The only addition we need to make is to use the hue argument to specify the categorical feature that defines the groups.

americas = data[data['Continent'].apply(lambda x: "America" in x)]

col = "Gross domestic product: % growth : 2016"

ax = sns.histplot(data = americas, x = col, hue="Continent",  multiple="stack");

ax.set(title="GDP of North American vs. South American countries");
../_images/f96879adc9923369ae137e73b856edd57a3792873f6616d9a44c4f10be170310.png

Note the use of the hue argument to histplot. It adds a new dimension to the plot, by coloring the bars according to the value of the categorical feature.

multiple="stack" is an optional argument and is used to improve the visibility when bars are stacked on top of each other.

These visualizations are arguably the most ubiquitous in science. The canonical version of overlaid histograms are where a new drug is tested against a placebo, and the distribution of some outcome (e.g. blood pressure) is plotted for the placebo group and the drug group.

Most common statistical tests are designed to answer the question: “Do these two groups differ?” This question is answered by comparing the distributions of the two groups.

4.2.1.2. Side-by-side box plots#

col = "Gross domestic product: % growth : 2016"

ax = sns.boxplot(data = data, y = col, x="Continent", width=0.9);

ax.set(title="GDP distribution of countries by continent 2016");
../_images/43b61e7267cfba869d7a3816fcd1490cd0275c230cc31a75d0ca7168763fdd3a.png

4.2.2. Visualizing Relationships#

In addition to visualizing the distribution of features, we often want to understand how two features are related.

4.2.2.1. Scatter Plots (2 or more numeric features)#

Scatter plots are one of the most useful tools in representing the relationship between two numerical features. They are particularly important in gauging the strength (correlation) of the relationship between features. Knowledge of these relationships can then motivate decisions in our modeling process.

In Matplotlib, we use the function plt.scatter to generate a scatter plot. Notice that unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x axis and the y axis.

wb = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/world_bank.csv', index_col=0)

ax = sns.scatterplot(data = wb, \
                     x ='Adult literacy rate: Female: % ages 15 and older: 2005-14', \
                     y = "per capita: % growth: 2016")

ax.set(title="Female adult literacy against % growth");
../_images/20dc6d5d1ad95247a0098f1492c5630be8b0e5357e5a9420051703849cf596ab.png

In Seaborn, we call the function sns.scatterplot. We use the x and y parameters to indicate the values to be plotted along the x and y axes, respectively. By using the hue parameter, we can specify a third variable to be used for coloring each scatter point.

sns.scatterplot(data = wb, \
                y = "per capita: % growth: 2016", \
                x = "Adult literacy rate: Female: % ages 15 and older: 2005-14", 
                hue = "Continent")

plt.title("Female adult literacy against % growth");
../_images/523323e32ff8f488f35e264f65263885c810a69e272629668f5daa7c82abaac9.png
ax = sns.scatterplot(data = wb, \
                    y = "per capita: % growth: 2016", \
                    x = "Adult literacy rate: Female: % ages 15 and older: 2005-14", 
                    hue = "Continent", \
                    size="Population: millions: 2016")

ax.figure.set_size_inches(8, 6);

ax.set(title="Female adult literacy against % growth");
../_images/858f73a637472ecebd5e2c0e7bcd8b72c2385e72d0ae4226dcf05b8be44a9491.png

4.2.2.2. Joint Plots (2 or more numeric features)#

sns.jointplot creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.

A joint plot visualizes both: relationship and distributions.

sns.jointplot(data = wb, 
              x = "per capita: % growth: 2016", \
              y = "Adult literacy rate: Female: % ages 15 and older: 2005-14")

# plt.suptitle allows us to shift the title up so it does not overlap with the histogram
plt.suptitle("Female adult literacy against % growth")
plt.subplots_adjust(top=0.9);
../_images/57bf618052c12eb4ae226aa9489cb2bda5d0d8b8e4f04098fc0d57a355464278.png

4.2.2.3. Hex plots#

Hex plots can be thought of as a two dimensional histograms that shows the joint distribution between two variables. This is particularly useful working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more datapoints that lie in the region enclosed by the hexagon.

We can generate a hex plot using sns.jointplot modified with the kind parameter.

sns.jointplot(data = wb, \
              x = "per capita: % growth: 2016", \
              y = "Adult literacy rate: Female: % ages 15 and older: 2005-14", \
              kind = "hex")

# plt.suptitle allows us to shift the title up so it does not overlap with the histogram
plt.suptitle("Female adult literacy against % growth")
plt.subplots_adjust(top=0.9);
../_images/1d65e95a4d5fd51f4fbdc01001584da45b6c4eda72323393124732d292f861b9.png

4.2.3. Temporal Data: Line Plot#

If you are trying to visualize the relationship between two numeric variables, and one of those variables is time, then you should use a line plot.

Line plots are useful for visualizing the relationship between two numeric variables when one of them is time.

In seaborn, we can create a line plot using the function sns.lineplot. We use the x and y parameters to specify the variable to be plotted along the x and y axes, respectively.

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv')

sns.lineplot(data = data, x = "Year", y = "Popular vote", hue='Result', marker='o');
../_images/310fd7f23152443f3182924cf0bb190d1c8cc9f3852033de4304a575ab9d7661.png

Note that seaborn automatically aggregates the data by taking the mean of each numeric variable at each time point. The shaded region around the line represents the 95% confidence interval for the mean. We’ll talk more about confidence intervals in a later lecture.

4.2.4. Multi-panel Visualizations#

To create a multi-panel visualization, we can use the sns.FacetGrid.

This class takes in a dataframe, the names of the variables that will form the row, column, or hue dimensions of the grid, and the plot type to be produced for each subset of the data. The plot type is provided as a method of the FacetGrid object.

import pandas as pd 
import seaborn as sns 

tips = sns.load_dataset("tips")

g = sns.FacetGrid(tips, col="time",  row="sex");
g.map(sns.scatterplot, "total_bill", "tip");
../_images/b7d56ce77547e55d935edf755390103fa2862dbd15e5036f4cd1bb813de5c50c.png

The variable specification in FacetGrid.map() requires a positional argument mapping, but if the function has a data parameter and accepts named variable assignments, you can also use FacetGrid.map_dataframe():

g = sns.FacetGrid(tips, col="time",  row="sex");
g.map_dataframe(sns.histplot, x="total_bill");
../_images/1b1a7bf44d21e241ccc938e6730daf26c9c0fda092b2bbed3aa30bb52dacfe1c.png

The FacetGrid constructor accepts a hue parameter. Setting this will condition the data on another variable and make multiple plots in different colors. Where possible, label information is tracked so that a single legend can be drawn:

g = sns.FacetGrid(tips, col="time", hue="sex");
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip");
g.add_legend();
../_images/aeb04dd798704809578d0657a9cd1bd52d90f7f8d9295186ce3ed953659f58c5.png

The FacetGrid object has some other useful parameters and methods for tweaking the plot:

g = sns.FacetGrid(tips, col="sex", row="time", margin_titles=True)
g.map_dataframe(sns.scatterplot, x="total_bill", y="tip")
g.set_axis_labels("Total bill ($)", "Tip ($)")
g.set_titles(col_template="{col_name} patrons", row_template="{row_name}")
g.set(xlim=(0, 60), ylim=(0, 12), xticks=[10, 30, 50], yticks=[2, 6, 10])
g.tight_layout()
g.savefig("facet_plot.png")
../_images/da56bce9c5a1881ea94df57bafbdd83b88ed7b336bb1ceeb71677d03f36b8989.png