{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing Data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Visualizing data is a key part of data science. It is not only a way to communicate your findings to others but, more importantly, **it is a way to understand your data**, models and algorithms better. \n", "\n", "In this section, we will learn how to use the libraries `matplotlib` and `seaborn` to create visualizations. \n", "\n", "`matplotlib` is the primary plotting library in Python. It is a very powerful and highly customizable library that can be used to create a wide variety of plots and graphs. However, despite its power, is can often be not very user-friendly, requiring a lot of code to create even simple plots. \n", "\n", "`seaborn` is a data visualization library built on top of `matplotlib` that is easier to use and creates more visually appealing plots. I will try to use `seaborn` whenever possible, but may have to fall back occasionally to `matplotlib` for formatting details and customization.\n", "\n", "Just as `pandas` is conventionally imported as `pd`, `matplotlib.pyplot` is conventionally imported as `plt` and `seaborn` is conventionally imported as `sns`.\n", "\n", "Let's start by importing the libraries we will need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "There has long been an impression amongst academics and practitioners that _\"numerical calculations are exact, but graphs are rough\"_. In 1973, Francis Anscombe set out to counter this common misconception by creating a set of four datasets that are today known as **Anscombe's quartet**.\n", "\n", "The code cell below downloads and loads it as `pandas` `DataFrame` this data set: " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datasetxy
0I10.08.04
1I8.06.95
2I13.07.58
3I9.08.81
4I11.08.33
\n", "
" ], "text/plain": [ " dataset x y\n", "0 I 10.0 8.04\n", "1 I 8.0 6.95\n", "2 I 13.0 7.58\n", "3 I 9.0 8.81\n", "4 I 11.0 8.33" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anscombe = sns.load_dataset(\"anscombe\")\n", "anscombe.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what the summary statistics of `x` and `y` features look like, with respect to `dataset` feature:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
dataset
I11.09.03.3166254.06.59.011.514.011.07.5009092.0315684.266.3157.588.5710.84
II11.09.03.3166254.06.59.011.514.011.07.5009092.0316573.106.6958.148.959.26
III11.09.03.3166254.06.59.011.514.011.07.5000002.0304245.396.2507.117.9812.74
IV11.09.03.3166258.08.08.08.019.011.07.5009092.0305795.256.1707.048.1912.50
\n", "
" ], "text/plain": [ " x y \\\n", " count mean std min 25% 50% 75% max count mean \n", "dataset \n", "I 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500909 \n", "II 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500909 \n", "III 11.0 9.0 3.316625 4.0 6.5 9.0 11.5 14.0 11.0 7.500000 \n", "IV 11.0 9.0 3.316625 8.0 8.0 8.0 8.0 19.0 11.0 7.500909 \n", "\n", " \n", " std min 25% 50% 75% max \n", "dataset \n", "I 2.031568 4.26 6.315 7.58 8.57 10.84 \n", "II 2.031657 3.10 6.695 8.14 8.95 9.26 \n", "III 2.030424 5.39 6.250 7.11 7.98 12.74 \n", "IV 2.030579 5.25 6.170 7.04 8.19 12.50 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anscombe.groupby(\"dataset\").describe()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that for all four unique values of `dataset`, we have eleven (`x`, `y`) values, as seen in `count`. \n", "\n", "For each value of `dataset`, `x` and `y` have nearly **identical simple descriptive statistics**.\n", "\n", "Now let's create a scatter plot of the data using `seaborn`:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "g = sns.FacetGrid(anscombe, col=\"dataset\");\n", "g.map(sns.scatterplot, \"x\", \"y\", s=100, color=\"orange\", linewidth=.5, edgecolor=\"black\");" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Anscombe's quartet demonstrates both the **importance of graphing data before analyzing** it and the **effect of outliers** and other influential observations on statistical properties." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Data Visualization is arguably the most mistake-prone part of the data science process. It is very easy to create misleading visualizations that lead to incorrect conclusions. It is therefore important to be aware of the common pitfalls and to avoid them. \n", "\n", "The following is a useful taxonomy for choosing the right visualization depending on your goals for your data:\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }