Up until now, we’ve discussed how to visualize single-feature distributions. Now, let’s understand how to visualize the relationship between more than one features.
We will continue to use the World Bank dataset, which contains information from 2015/16 about countries around the world. We will use the same features as before: GDP per capita, life expectancy, and population.
import pandas as pd import seaborn as sns from matplotlib import pyplot as pltsns.set_style('whitegrid')data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/world_bank.csv', index_col=0)data.head()
Continent
Country
Primary completion rate: Male: % of relevant age group: 2015
Primary completion rate: Female: % of relevant age group: 2015
Lower secondary completion rate: Male: % of relevant age group: 2015
Lower secondary completion rate: Female: % of relevant age group: 2015
Youth literacy rate: Male: % of ages 15-24: 2005-14
Youth literacy rate: Female: % of ages 15-24: 2005-14
Adult literacy rate: Male: % ages 15 and older: 2005-14
Adult literacy rate: Female: % ages 15 and older: 2005-14
...
Access to improved sanitation facilities: % of population: 1990
Access to improved sanitation facilities: % of population: 2015
Child immunization rate: Measles: % of children ages 12-23 months: 2015
Child immunization rate: DTP3: % of children ages 12-23 months: 2015
Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016
Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016
Children sleeping under treated bed nets: % of children under age 5: 2009-2016
Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016
Tuberculosis: Treatment success rate: % of new cases: 2014
Tuberculosis: Cases detection rate: % of new estimated cases: 2015
0
Africa
Algeria
106.0
105.0
68.0
85.0
96.0
92.0
83.0
68.0
...
80.0
88.0
95.0
95.0
66.0
42.0
NaN
NaN
88.0
80.0
1
Africa
Angola
NaN
NaN
NaN
NaN
79.0
67.0
82.0
60.0
...
22.0
52.0
55.0
64.0
NaN
NaN
25.9
28.3
34.0
64.0
2
Africa
Benin
83.0
73.0
50.0
37.0
55.0
31.0
41.0
18.0
...
7.0
20.0
75.0
79.0
23.0
33.0
72.7
25.9
89.0
61.0
3
Africa
Botswana
98.0
101.0
86.0
87.0
96.0
99.0
87.0
89.0
...
39.0
63.0
97.0
95.0
NaN
NaN
NaN
NaN
77.0
62.0
5
Africa
Burundi
58.0
66.0
35.0
30.0
90.0
88.0
89.0
85.0
...
42.0
48.0
93.0
94.0
55.0
43.0
53.8
25.4
91.0
51.0
5 rows × 47 columns
Distribution of a numeric feature w.r.t a categorical features
Let’s start by visualizing the distribution of a numeric feature, across the categories defined by a categorical feature. In other words, we want to visualize the distribution of a numeric feature, separately for each category of another categorical feature.
Overlaid Histograms (1 numeric, 1 categorical)
We can use a histogram to visualize the distribution of a numeric variable. To visualize how this distribution differs between the groups created by another categorical variable, we can create a histogram for each group separately.
In order to create overlaid histograms, we will continue to use sns.histplot. The only addition we need to make is to use the hue argument to specify the categorical feature that defines the groups.
americas = data[data['Continent'].apply(lambda x: "America"in x)]col ="Gross domestic product: % growth : 2016"ax = sns.histplot(data = americas, x = col, hue="Continent", multiple="stack");ax.set(title="GDP of North American vs. South American countries");
Note the use of the hue argument to histplot. It adds a new dimension to the plot, by coloring the bars according to the value of the categorical feature.
multiple="stack" is an optional argument and is used to improve the visibility when bars are stacked on top of each other.
These visualizations are arguably the most ubiquitous in science. The canonical version of overlaid histograms are where a new drug is tested against a placebo, and the distribution of some outcome (e.g. blood pressure) is plotted for the placebo group and the drug group.
Most common statistical tests are designed to answer the question: “Do these two groups differ?” This question is answered by comparing the distributions of the two groups.
Side-by-side box plots
col ="Gross domestic product: % growth : 2016"ax = sns.boxplot(data = data, y = col, x="Continent", width=0.9);ax.set(title="GDP distribution of countries by continent 2016");
Visualizing Relationships
In addition to visualizing the distribution of features, we often want to understand how two features are related.
Scatter Plots (2 or more numeric features)
Scatter plots are one of the most useful tools in representing the relationship between two numerical features. They are particularly important in gauging the strength (correlation) of the relationship between features. Knowledge of these relationships can then motivate decisions in our modeling process.
In Matplotlib, we use the function plt.scatter to generate a scatter plot. Notice that unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x axis and the y axis.
wb = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/world_bank.csv', index_col=0)ax = sns.scatterplot(data = wb, \ x ='Adult literacy rate: Female: % ages 15 and older: 2005-14', \ y ="per capita: % growth: 2016")ax.set(title="Female adult literacy against % growth");
In Seaborn, we call the function sns.scatterplot. We use the x and y parameters to indicate the values to be plotted along the x and y axes, respectively. By using the hue parameter, we can specify a third variable to be used for coloring each scatter point.
sns.scatterplot(data = wb, \ y ="per capita: % growth: 2016", \ x ="Adult literacy rate: Female: % ages 15 and older: 2005-14", hue ="Continent")plt.title("Female adult literacy against % growth");
ax = sns.scatterplot(data = wb, \ y ="per capita: % growth: 2016", \ x ="Adult literacy rate: Female: % ages 15 and older: 2005-14", hue ="Continent", \ size="Population: millions: 2016")ax.figure.set_size_inches(8, 6);ax.set(title="Female adult literacy against % growth");
Joint Plots (2 or more numeric features)
sns.jointplot creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.
A joint plot visualizes both: relationship and distributions.
sns.jointplot(data = wb, x ="per capita: % growth: 2016", \ y ="Adult literacy rate: Female: % ages 15 and older: 2005-14")# plt.suptitle allows us to shift the title up so it does not overlap with the histogramplt.suptitle("Female adult literacy against % growth")plt.subplots_adjust(top=0.9);
Hex plots
Hex plots can be thought of as a two dimensional histograms that shows the joint distribution between two variables. This is particularly useful working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more datapoints that lie in the region enclosed by the hexagon.
We can generate a hex plot using sns.jointplot modified with the kind parameter.
sns.jointplot(data = wb, \ x ="per capita: % growth: 2016", \ y ="Adult literacy rate: Female: % ages 15 and older: 2005-14", \ kind ="hex")# plt.suptitle allows us to shift the title up so it does not overlap with the histogramplt.suptitle("Female adult literacy against % growth")plt.subplots_adjust(top=0.9);
Temporal Data: Line Plot
If you are trying to visualize the relationship between two numeric variables, and one of those variables is time, then you should use a line plot.
Line plots are useful for visualizing the relationship between two numeric variables when one of them is time.
In seaborn, we can create a line plot using the function sns.lineplot. We use the x and y parameters to specify the variable to be plotted along the x and y axes, respectively.
data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv')sns.lineplot(data = data, x ="Year", y ="Popular vote", hue='Result', marker='o');
Note that seaborn automatically aggregates the data by taking the mean of each numeric variable at each time point. The shaded region around the line represents the 95% confidence interval for the mean. We’ll talk more about confidence intervals in a later lecture.
Multi-panel Visualizations
To create a multi-panel visualization, we can use the sns.FacetGrid.
This class takes in a dataframe, the names of the variables that will form the row, column, or hue dimensions of the grid, and the plot type to be produced for each subset of the data. The plot type is provided as a method of the FacetGrid object.
import pandas as pd import seaborn as sns tips = sns.load_dataset("tips")g = sns.FacetGrid(tips, col="time", row="sex");g.map(sns.scatterplot, "total_bill", "tip");
The variable specification in FacetGrid.map() requires a positional argument mapping, but if the function has a data parameter and accepts named variable assignments, you can also use FacetGrid.map_dataframe():
g = sns.FacetGrid(tips, col="time", row="sex");g.map_dataframe(sns.histplot, x="total_bill");
The FacetGrid constructor accepts a hue parameter. Setting this will condition the data on another variable and make multiple plots in different colors. Where possible, label information is tracked so that a single legend can be drawn:
g = sns.FacetGrid(tips, col="time", hue="sex");g.map_dataframe(sns.scatterplot, x="total_bill", y="tip");g.add_legend();
The FacetGrid object has some other useful parameters and methods for tweaking the plot: