Visualizing Data

4. Visualizing Data#

Visualizing data is a key part of data science. It is not only a way to communicate your findings to others but, more importantly, it is a way to understand your data, models and algorithms better.

In this section, we will learn how to use the libraries matplotlib and seaborn to create visualizations.

matplotlib is the primary plotting library in Python. It is a very powerful and highly customizable library that can be used to create a wide variety of plots and graphs. However, despite its power, is can often be not very user-friendly, requiring a lot of code to create even simple plots.

seaborn is a data visualization library built on top of matplotlib that is easier to use and creates more visually appealing plots. I will try to use seaborn whenever possible, but may have to fall back occasionally to matplotlib for formatting details and customization.

Just as pandas is conventionally imported as pd, matplotlib.pyplot is conventionally imported as plt and seaborn is conventionally imported as sns.

Let’s start by importing the libraries we will need.

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

There has long been an impression amongst academics and practitioners that “numerical calculations are exact, but graphs are rough”. In 1973, Francis Anscombe set out to counter this common misconception by creating a set of four datasets that are today known as Anscombe’s quartet.

The code cell below downloads and loads it as pandas DataFrame this data set:

anscombe = sns.load_dataset("anscombe")
anscombe.head()

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33

Now let’s see what the summary statistics of x and y features look like, with respect to dataset feature:

anscombe.groupby("dataset").describe()

	x								y
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
dataset
I	11.0	9.0	3.316625	4.0	6.5	9.0	11.5	14.0	11.0	7.500909	2.031568	4.26	6.315	7.58	8.57	10.84
II	11.0	9.0	3.316625	4.0	6.5	9.0	11.5	14.0	11.0	7.500909	2.031657	3.10	6.695	8.14	8.95	9.26
III	11.0	9.0	3.316625	4.0	6.5	9.0	11.5	14.0	11.0	7.500000	2.030424	5.39	6.250	7.11	7.98	12.74
IV	11.0	9.0	3.316625	8.0	8.0	8.0	8.0	19.0	11.0	7.500909	2.030579	5.25	6.170	7.04	8.19	12.50

Note that for all four unique values of dataset, we have eleven (x, y) values, as seen in count.

For each value of dataset, x and y have nearly identical simple descriptive statistics.

Now let’s create a scatter plot of the data using seaborn:

g = sns.FacetGrid(anscombe, col="dataset");
g.map(sns.scatterplot, "x", "y", s=100, color="orange", linewidth=.5, edgecolor="black");

../_images/aad26c65f9fe005d3adb2242c58a8078f43e9039e244a68f63e6e04f4e4f90d5.png

Anscombe’s quartet demonstrates both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.

Data Visualization is arguably the most mistake-prone part of the data science process. It is very easy to create misleading visualizations that lead to incorrect conclusions. It is therefore important to be aware of the common pitfalls and to avoid them.

The following is a useful taxonomy for choosing the right visualization depending on your goals for your data: