{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Common Data Formats \n",
    "\n",
    "Format is a term used to describe the way data is stored. For example, a single image is stored as a 2D array of pixels. A video is stored as a sequence of images. A sound is stored as a 1D array of samples. A text is stored as a sequence of characters.\n",
    "\n",
    "### Flat Formats\n",
    "\n",
    "Flat data formats are native to `pandas` and are the simplest and the most ubiquitous file formats in general. \n",
    "\n",
    "To avoid data redundancy, data is often factored into multiple tables. For example, in a database of a school, there may be a table for students, a table for teachers, a table for classes, a table for grades, etc. Depending on the question of interest, these tables are then joined together to form a single table.\n",
    "\n",
    "#### Comma Separated Values (CSV)\n",
    "\n",
    "CSV is an open format used to store tabular data. It is a text file where **each line is a row of data**. In other words, **each line is separated by newline character `\\n`**. Within a row, **each column is separated by a comma `,`**. The first row is optionally the header row containing the names of the columns. \n",
    "\n",
    "Example of a CSV file is the [elections.csv](https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv) file. \n",
    "\n",
    "In pandas you can **read a CSV** file as a pandas DataFrame using the `pd.read_csv()` function. \n",
    "\n",
    "The first input to `pd.read_csv()` is the filename or filepath that you want to read. The other three most important parameters of the `pd.read_csv()` function are: \n",
    "1. `sep`: (default: `sep=','`) specifies the separator used to separate columns. Default is `,` which means the columns are separated by a comma.\n",
    "\n",
    "2. `header`:  (default: `header=0`) specifies the row number to be used as the header. Default is `0` which means the first row is assumed to contain column names. If the file does not contain a header row, then `header=None` should be used.\n",
    "\n",
    "    * `names`: (default: `names=None`) specifies the column names as a list of strings. Default is `None` which means the column names are read from the header row.\n",
    "\n",
    "\n",
    "3. `index_col`: (default: `index_col=None`) specifies the column number or column name to be used as the index. Default is `None` which means the index is automatically generated as integers starting from 0. If the file does not contain a header row, then `index_col=0` should be used.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Candidate</th>\n",
       "      <th>Party</th>\n",
       "      <th>Popular vote</th>\n",
       "      <th>Result</th>\n",
       "      <th>%</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1824</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic-Republican</td>\n",
       "      <td>151271</td>\n",
       "      <td>loss</td>\n",
       "      <td>57.210122</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1824</td>\n",
       "      <td>John Quincy Adams</td>\n",
       "      <td>Democratic-Republican</td>\n",
       "      <td>113142</td>\n",
       "      <td>win</td>\n",
       "      <td>42.789878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1828</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>642806</td>\n",
       "      <td>win</td>\n",
       "      <td>56.203927</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1828</td>\n",
       "      <td>John Quincy Adams</td>\n",
       "      <td>National Republican</td>\n",
       "      <td>500897</td>\n",
       "      <td>loss</td>\n",
       "      <td>43.796073</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1832</td>\n",
       "      <td>Andrew Jackson</td>\n",
       "      <td>Democratic</td>\n",
       "      <td>702735</td>\n",
       "      <td>win</td>\n",
       "      <td>54.574789</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Year          Candidate                  Party  Popular vote Result  \\\n",
       "0  1824     Andrew Jackson  Democratic-Republican        151271   loss   \n",
       "1  1824  John Quincy Adams  Democratic-Republican        113142    win   \n",
       "2  1828     Andrew Jackson             Democratic        642806    win   \n",
       "3  1828  John Quincy Adams    National Republican        500897   loss   \n",
       "4  1832     Andrew Jackson             Democratic        702735    win   \n",
       "\n",
       "           %  \n",
       "0  57.210122  \n",
       "1  42.789878  \n",
       "2  56.203927  \n",
       "3  43.796073  \n",
       "4  54.574789  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "\n",
    "url  = 'https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv'\n",
    "data = pd.read_csv(url)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To write a DataFrame to a CSV file, use the `df.to_csv()` function. The first input to `df.to_csv()` is the filename or filepath that you want to write to. Other important parameters of the `df.to_csv()` function are:\n",
    "\n",
    "1. `sep`: (default: `sep=','`) specifies the separator used to separate columns. Default is `,` which means the columns are separated by a comma.\n",
    "\n",
    "2. `header`:  (default: `header=True`) specifies whether to write the header row. Default is `True` which means the header row is written. If you don't want to write the header row, then `header=False` should be used.\n",
    "\n",
    "3. `index`: (default: `index=True`) specifies whether to write the index column. Default is `True` which means the index column is written. If you don't want to write the index column, then `index=False` should be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.to_csv('elections.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The CSV format has many variations, each different in regard of the separator used. For instance, a particularly popular variant called **TSV (Tab Separated Values)** uses tab character `\\t` to separate columns instead of a comma. In other variants the column separator can also be a semicolon `;` or a pipe `|`. \n",
    "\n",
    "Example of a TSV file is [restaurants.tsv](https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants.tsv) file. \n",
    "\n",
    "Note the use of **`sep='\\t'`** parameter in `pd.read_csv()` in the code below: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>business_name</th>\n",
       "      <th>business_address</th>\n",
       "      <th>business_city</th>\n",
       "      <th>business_state</th>\n",
       "      <th>business_postal_code</th>\n",
       "      <th>business_latitude</th>\n",
       "      <th>business_longitude</th>\n",
       "      <th>business_location</th>\n",
       "      <th>business_phone_number</th>\n",
       "      <th>...</th>\n",
       "      <th>inspection_score</th>\n",
       "      <th>inspection_type</th>\n",
       "      <th>violation_id</th>\n",
       "      <th>violation_description</th>\n",
       "      <th>risk_category</th>\n",
       "      <th>Neighborhoods</th>\n",
       "      <th>SF Find Neighborhoods</th>\n",
       "      <th>Current Police Districts</th>\n",
       "      <th>Current Supervisor Districts</th>\n",
       "      <th>Analysis Neighborhoods</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>835</td>\n",
       "      <td>Kam Po Kitchen</td>\n",
       "      <td>801 Broadway St</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>CA</td>\n",
       "      <td>94133</td>\n",
       "      <td>37.797223</td>\n",
       "      <td>-122.410513</td>\n",
       "      <td>POINT (-122.410513 37.797223)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>88.0</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>835_20180917_103139</td>\n",
       "      <td>Improper food storage</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>107.0</td>\n",
       "      <td>107.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>905</td>\n",
       "      <td>Working Girls' Cafe'</td>\n",
       "      <td>0259 Kearny St</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>CA</td>\n",
       "      <td>94108</td>\n",
       "      <td>37.790477</td>\n",
       "      <td>-122.404033</td>\n",
       "      <td>POINT (-122.404033 37.790477)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>87.0</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>905_20190415_103114</td>\n",
       "      <td>High risk vermin infestation</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>19.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1203</td>\n",
       "      <td>TAWAN'S THAI FOOD</td>\n",
       "      <td>4403 GEARY Blvd</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>CA</td>\n",
       "      <td>94118</td>\n",
       "      <td>37.780834</td>\n",
       "      <td>-122.466590</td>\n",
       "      <td>POINT (-122.46659 37.780834)</td>\n",
       "      <td>1.415576e+10</td>\n",
       "      <td>...</td>\n",
       "      <td>77.0</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>1203_20170803_103120</td>\n",
       "      <td>Moderate risk food holding temperature</td>\n",
       "      <td>Moderate Risk</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1345</td>\n",
       "      <td>Cordon Bleu</td>\n",
       "      <td>1574 California St</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>CA</td>\n",
       "      <td>94109</td>\n",
       "      <td>37.790683</td>\n",
       "      <td>-122.420264</td>\n",
       "      <td>POINT (-122.420264 37.790683)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>81.0</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>1345_20170928_103105</td>\n",
       "      <td>Improper cooling methods</td>\n",
       "      <td>High Risk</td>\n",
       "      <td>105.0</td>\n",
       "      <td>105.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1352</td>\n",
       "      <td>LA TORTILLA</td>\n",
       "      <td>495 Castro St B</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>CA</td>\n",
       "      <td>94114</td>\n",
       "      <td>37.760954</td>\n",
       "      <td>-122.434935</td>\n",
       "      <td>POINT (-122.434935 37.760954)</td>\n",
       "      <td>1.415586e+10</td>\n",
       "      <td>...</td>\n",
       "      <td>74.0</td>\n",
       "      <td>Routine - Unscheduled</td>\n",
       "      <td>1352_20180620_103177</td>\n",
       "      <td>Non service animal</td>\n",
       "      <td>Low Risk</td>\n",
       "      <td>38.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   business_id         business_name    business_address  business_city  \\\n",
       "0          835        Kam Po Kitchen     801 Broadway St  San Francisco   \n",
       "1          905  Working Girls' Cafe'      0259 Kearny St  San Francisco   \n",
       "2         1203     TAWAN'S THAI FOOD     4403 GEARY Blvd  San Francisco   \n",
       "3         1345           Cordon Bleu  1574 California St  San Francisco   \n",
       "4         1352           LA TORTILLA     495 Castro St B  San Francisco   \n",
       "\n",
       "  business_state business_postal_code  business_latitude  business_longitude  \\\n",
       "0             CA                94133          37.797223         -122.410513   \n",
       "1             CA                94108          37.790477         -122.404033   \n",
       "2             CA                94118          37.780834         -122.466590   \n",
       "3             CA                94109          37.790683         -122.420264   \n",
       "4             CA                94114          37.760954         -122.434935   \n",
       "\n",
       "               business_location  business_phone_number  ... inspection_score  \\\n",
       "0  POINT (-122.410513 37.797223)                    NaN  ...             88.0   \n",
       "1  POINT (-122.404033 37.790477)                    NaN  ...             87.0   \n",
       "2   POINT (-122.46659 37.780834)           1.415576e+10  ...             77.0   \n",
       "3  POINT (-122.420264 37.790683)                    NaN  ...             81.0   \n",
       "4  POINT (-122.434935 37.760954)           1.415586e+10  ...             74.0   \n",
       "\n",
       "         inspection_type          violation_id  \\\n",
       "0  Routine - Unscheduled   835_20180917_103139   \n",
       "1  Routine - Unscheduled   905_20190415_103114   \n",
       "2  Routine - Unscheduled  1203_20170803_103120   \n",
       "3  Routine - Unscheduled  1345_20170928_103105   \n",
       "4  Routine - Unscheduled  1352_20180620_103177   \n",
       "\n",
       "                    violation_description  risk_category Neighborhoods  \\\n",
       "0                   Improper food storage       Low Risk         107.0   \n",
       "1            High risk vermin infestation      High Risk          19.0   \n",
       "2  Moderate risk food holding temperature  Moderate Risk           5.0   \n",
       "3                Improper cooling methods      High Risk         105.0   \n",
       "4                      Non service animal       Low Risk          38.0   \n",
       "\n",
       "  SF Find Neighborhoods  Current Police Districts  \\\n",
       "0                 107.0                       6.0   \n",
       "1                  19.0                       6.0   \n",
       "2                   5.0                       8.0   \n",
       "3                 105.0                       4.0   \n",
       "4                  38.0                       3.0   \n",
       "\n",
       "   Current Supervisor Districts  Analysis Neighborhoods  \n",
       "0                           3.0                     6.0  \n",
       "1                           3.0                     8.0  \n",
       "2                           4.0                    11.0  \n",
       "3                           3.0                    21.0  \n",
       "4                           5.0                     5.0  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "url  = \"https://raw.githubusercontent.com/fahadsultan/csc272/main/data/restaurants.tsv\"\n",
    "data = pd.read_csv(url, sep='\\t', index_col=0)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "#### Excel Spreadsheet (XLSX)\n",
    "\n",
    "XLSX is a proprietary format also used to store tabular data. Unlike a CSV, an XLSX is not a plain text file that you can simply read in any text editor. In contrast, an XLSX is a binary file which can be read only in specific software such as Microsft Excel or OpenOffice Calc. An excel spreedsheet does a lot more than just store(tabular) data such as storing formulas, charts and images, etc.\n",
    "\n",
    "However, for our purposes here, the only distinction between a CSV file and an XLSX file is that a) an XLSX file can contain multiple sheets where each sheet is a table and b) you can read an XLSX file in pandas using the **`pd.read_excel()`** function.\n",
    "\n",
    "`pd.read_excel` is very similar to `pd.read_csv` where the first input is the filename or filepath that you want to read. Other inputs such as `header`, `names`, `index_col` are the same as `pd.read_csv`. The only additional input is:\n",
    "\n",
    "* `sheet_name`: (default: `sheet_name=0`) specifies the sheet number or sheet name to be read. Default is `0` which means the first sheet is read. If the file contains multiple sheets, then `sheet_name=None` should be used.\n",
    "\n",
    "You can download a sample excel spreadsheet from [here](../data/Financial%20Sample.xlsx)\n",
    "\n",
    "Note that the file contains two sheets: _\"All Data\"_ and _\"Just US\"_\n",
    "\n",
    "The line of code below reads in just the sheet labeled _\"Just US\"_ from the spreadsheet, using the `sheet_name` parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_excel?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Segment</th>\n",
       "      <th>Country</th>\n",
       "      <th>Product</th>\n",
       "      <th>Discount Band</th>\n",
       "      <th>Units Sold</th>\n",
       "      <th>Manufacturing Price</th>\n",
       "      <th>Sale Price</th>\n",
       "      <th>Gross Sales</th>\n",
       "      <th>Discounts</th>\n",
       "      <th>Sales</th>\n",
       "      <th>COGS</th>\n",
       "      <th>Profit</th>\n",
       "      <th>Date</th>\n",
       "      <th>Month Number</th>\n",
       "      <th>Month Name</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Midmarket</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>Montana</td>\n",
       "      <td>None</td>\n",
       "      <td>615.0</td>\n",
       "      <td>5</td>\n",
       "      <td>15</td>\n",
       "      <td>9225.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>9225.0</td>\n",
       "      <td>6150.0</td>\n",
       "      <td>3075.0</td>\n",
       "      <td>41974</td>\n",
       "      <td>12</td>\n",
       "      <td>December</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Government</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>Paseo</td>\n",
       "      <td>None</td>\n",
       "      <td>1143.0</td>\n",
       "      <td>10</td>\n",
       "      <td>7</td>\n",
       "      <td>8001.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8001.0</td>\n",
       "      <td>5715.0</td>\n",
       "      <td>2286.0</td>\n",
       "      <td>41913</td>\n",
       "      <td>10</td>\n",
       "      <td>October</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Channel Partners</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>Paseo</td>\n",
       "      <td>None</td>\n",
       "      <td>912.0</td>\n",
       "      <td>10</td>\n",
       "      <td>12</td>\n",
       "      <td>10944.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10944.0</td>\n",
       "      <td>2736.0</td>\n",
       "      <td>8208.0</td>\n",
       "      <td>41579</td>\n",
       "      <td>11</td>\n",
       "      <td>November</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Enterprise</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>Velo</td>\n",
       "      <td>None</td>\n",
       "      <td>2821.0</td>\n",
       "      <td>120</td>\n",
       "      <td>125</td>\n",
       "      <td>352625.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>352625.0</td>\n",
       "      <td>338520.0</td>\n",
       "      <td>14105.0</td>\n",
       "      <td>41852</td>\n",
       "      <td>8</td>\n",
       "      <td>August</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Channel Partners</td>\n",
       "      <td>United States of America</td>\n",
       "      <td>Amarilla</td>\n",
       "      <td>None</td>\n",
       "      <td>1953.0</td>\n",
       "      <td>260</td>\n",
       "      <td>12</td>\n",
       "      <td>23436.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>23436.0</td>\n",
       "      <td>5859.0</td>\n",
       "      <td>17577.0</td>\n",
       "      <td>41730</td>\n",
       "      <td>4</td>\n",
       "      <td>April</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Segment                   Country   Product Discount Band  \\\n",
       "0         Midmarket  United States of America   Montana          None   \n",
       "1        Government  United States of America     Paseo          None   \n",
       "2  Channel Partners  United States of America     Paseo          None   \n",
       "3        Enterprise  United States of America      Velo          None   \n",
       "4  Channel Partners  United States of America  Amarilla          None   \n",
       "\n",
       "   Units Sold  Manufacturing Price  Sale Price  Gross Sales  Discounts  \\\n",
       "0       615.0                    5          15       9225.0        0.0   \n",
       "1      1143.0                   10           7       8001.0        0.0   \n",
       "2       912.0                   10          12      10944.0        0.0   \n",
       "3      2821.0                  120         125     352625.0        0.0   \n",
       "4      1953.0                  260          12      23436.0        0.0   \n",
       "\n",
       "      Sales      COGS   Profit   Date  Month Number Month Name  Year  \n",
       "0    9225.0    6150.0   3075.0  41974            12   December  2014  \n",
       "1    8001.0    5715.0   2286.0  41913            10    October  2014  \n",
       "2   10944.0    2736.0   8208.0  41579            11   November  2013  \n",
       "3  352625.0  338520.0  14105.0  41852             8     August  2014  \n",
       "4   23436.0    5859.0  17577.0  41730             4      April  2014  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "data = pd.read_excel('../data/Financial Sample.xlsx', sheet_name='US Only')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as you can write any pandas DataFrame to CSV file using `df.to_csv()`, you can write any DataFrame to XLSX file using the `df.to_excel()` function. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "### Hierarchical (Nested) Formats\n",
    "\n",
    "Hierarchical data formats are used to store data that is inherently hierarchical. These formats are particularly popular on the internet for exchange of information with web services using APIs (Application Programming Interfaces).\n",
    "\n",
    "#### Extensible Markup Language (XML)\n",
    "\n",
    "XML is a format used to store hierarchical data. Data in XML is stored as a tree structure. This tree is constituent of **nodes**. Each node has a start tag and an end tag. The start tag is enclosed in angle brackets `<` and `>` e.g. `<name>`. The end tag is also enclosed in angle brackets but it also has a forward slash `/` after the opening angle bracket e.g. `</name>`.\n",
    "\n",
    "The start tag and the end tag  together are called an **element**. \n",
    "\n",
    "The start tag can optionally contain **attributes**. Attributes are name-value pairs. The value is enclosed in double quotes `\"`. The start tag can optionally contain **child elements**. Child elements are enclosed between the start tag and the end tag. The end tag can optionally contain **text**. Text is the value of the node. The text is enclosed between the start tag and the end tag. \n",
    "\n",
    "The first node in an XML file is called the **root node**.\n",
    "\n",
    "Example of an XML file is \n",
    "\n",
    "``` xml\n",
    "<note>\n",
    "<to>Tove</to>\n",
    "<from>Jani</from>\n",
    "<heading>Reminder</heading>\n",
    "<body>Don't forget me this weekend!</body>\n",
    "</note>\n",
    "```\n",
    "\n",
    "``` xml\n",
    "<breakfast_menu>\n",
    "    <food>\n",
    "        <name>Belgian Waffles</name>\n",
    "        <price>$5.95</price>\n",
    "        <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>\n",
    "        <calories>650</calories>\n",
    "    </food>\n",
    "    <food>\n",
    "        <name>Strawberry Belgian Waffles</name>\n",
    "        <price>$7.95</price>\n",
    "        <description>Light Belgian waffles covered with strawberries and whipped cream</description>\n",
    "        <calories>900</calories>\n",
    "    </food>\n",
    "    <food>\n",
    "        <name>Berry-Berry Belgian Waffles</name>\n",
    "        <price>$8.95</price>\n",
    "        <description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>\n",
    "        <calories>900</calories>\n",
    "    </food>\n",
    "    <food>\n",
    "        <name>French Toast</name>\n",
    "        <price>$4.50</price>\n",
    "        <description>Thick slices made from our homemade sourdough bread</description>\n",
    "        <calories>600</calories>\n",
    "    </food>\n",
    "    <food>\n",
    "        <name>Homestyle Breakfast</name>\n",
    "        <price>$6.95</price>\n",
    "        <description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>\n",
    "        <calories>950</calories>\n",
    "    </food>\n",
    "</breakfast_menu>\n",
    "```\n",
    "\n",
    "Other XML files are: [Breakfast Menu](https://www.w3schools.com/xml/simple.xml), [Plant Catalog](https://www.w3schools.com/xml/plant_catalog.xml) and [CD catalog](https://www.w3schools.com/xml/cd_catalog.xml). \n",
    "\n",
    "It is a text file where each line is a **node** of data.  **Each node has a name and a value**. The name is separated from the value by a colon. The value is separated from the name by a colon. The first node is the root node. The root node contains the names of the nodes. The root node is separated from the data nodes by a blank line.\n",
    "\n",
    "In pandas, you can read an XML file using the `pd.read_xml()` function. The first input to `pd.read_xml()` is the filename or filepath that you want to read. Other important parameters of the `pd.read_xml()` function are:\n",
    "\n",
    "* `xpath`: (default: `xpath=None`) specifies the path to the node(s) to be read. Default is `None` which means the entire XML file is read. If you want to read a specific node, then `xpath` should be used.\n",
    "\n",
    "* `namespaces`: (default: `namespaces=None`) specifies the namespaces used in the XML file. Default is `None` which means no namespaces are used. If the XML file uses namespaces, then `namespaces` should be used.\n",
    "\n",
    "* `encoding`: (default: `encoding=None`) specifies the encoding of the XML file. Default is `None` which means the encoding is automatically detected. If the XML file uses a specific encoding, then `encoding` should be used.\n",
    "\n",
    "* `errors`: (default: `errors=None`) specifies how to handle errors. Default is `None` which means the errors are ignored. If the XML file contains errors, then `errors` should be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>price</th>\n",
       "      <th>description</th>\n",
       "      <th>calories</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Belgian Waffles</td>\n",
       "      <td>$5.95</td>\n",
       "      <td>Two of our famous Belgian Waffles with plenty ...</td>\n",
       "      <td>650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Strawberry Belgian Waffles</td>\n",
       "      <td>$7.95</td>\n",
       "      <td>Light Belgian waffles covered with strawberrie...</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Berry-Berry Belgian Waffles</td>\n",
       "      <td>$8.95</td>\n",
       "      <td>Light Belgian waffles covered with an assortme...</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>French Toast</td>\n",
       "      <td>$4.50</td>\n",
       "      <td>Thick slices made from our homemade sourdough ...</td>\n",
       "      <td>600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Homestyle Breakfast</td>\n",
       "      <td>$6.95</td>\n",
       "      <td>Two eggs, bacon or sausage, toast, and our eve...</td>\n",
       "      <td>950</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          name  price  \\\n",
       "0              Belgian Waffles  $5.95   \n",
       "1   Strawberry Belgian Waffles  $7.95   \n",
       "2  Berry-Berry Belgian Waffles  $8.95   \n",
       "3                 French Toast  $4.50   \n",
       "4          Homestyle Breakfast  $6.95   \n",
       "\n",
       "                                         description  calories  \n",
       "0  Two of our famous Belgian Waffles with plenty ...       650  \n",
       "1  Light Belgian waffles covered with strawberrie...       900  \n",
       "2  Light Belgian waffles covered with an assortme...       900  \n",
       "3  Thick slices made from our homemade sourdough ...       600  \n",
       "4  Two eggs, bacon or sausage, toast, and our eve...       950  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_xml('https://www.w3schools.com/xml/simple.xml')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "#### JavaScript Object Notation (JSON)\n",
    "\n",
    "JSON is an open data format used to store hierarchical data. JSON data resembles a Python dictionary. It is a text file where each line is a **key-value pair**. The key is separated from the value by a colon. Similar to a Python dictionary, the value can be a string, a number, a dictionary, a boolean or a list.\n",
    "\n",
    "JSON data can be oriented in two ways: **records** and **columns**. \n",
    "\n",
    "In the **records** orientation, each line is a record. \n",
    "\n",
    "\n",
    "``` json\n",
    "[\n",
    "  {\n",
    "    \"name\": \"Belgian Waffles\",\n",
    "    \"price\": \"$5.95\",\n",
    "    \"description\": \"Two of our famous Belgian Waffles with plenty of real maple syrup\",\n",
    "    \"calories\": 650\n",
    "  },\n",
    "  {\n",
    "    \"name\": \"Strawberry Belgian Waffles\",\n",
    "    \"price\": \"$7.95\",\n",
    "    \"description\": \"Light Belgian waffles covered with strawberries and whipped cream\",\n",
    "    \"calories\": 900\n",
    "  },\n",
    "  {\n",
    "    \"name\": \"Berry-Berry Belgian Waffles\",\n",
    "    \"price\": \"$8.95\",\n",
    "    \"description\": \"Light Belgian waffles covered with an assortment of fresh berries and whipped cream\",\n",
    "    \"calories\": 900\n",
    "  },\n",
    "  {\n",
    "    \"name\": \"French Toast\",\n",
    "    \"price\": \"$4.50\",\n",
    "    \"description\": \"Thick slices made from our homemade sourdough bread\",\n",
    "    \"calories\": 600\n",
    "  },\n",
    "  {\n",
    "    \"name\": \"Homestyle Breakfast\",\n",
    "    \"price\": \"$6.95\",\n",
    "    \"description\": \"Two eggs, bacon or sausage, toast, and our ever-popular hash browns\",\n",
    "    \"calories\": 950\n",
    "  }\n",
    "]\n",
    "```\n",
    "\n",
    "In the **columns** orientation, each line is a column. The same JSON data can be represented as a table as follows:\n",
    "\n",
    "``` json \n",
    "{\n",
    "  \"name\": {\n",
    "    \"0\": \"Belgian Waffles\",\n",
    "    \"1\": \"Strawberry Belgian Waffles\",\n",
    "    \"2\": \"Berry-Berry Belgian Waffles\",\n",
    "    \"3\": \"French Toast\",\n",
    "    \"4\": \"Homestyle Breakfast\"\n",
    "  },\n",
    "  \"price\": {\n",
    "    \"0\": \"$5.95\",\n",
    "    \"1\": \"$7.95\",\n",
    "    \"2\": \"$8.95\",\n",
    "    \"3\": \"$4.50\",\n",
    "    \"4\": \"$6.95\"\n",
    "  },\n",
    "  \"description\": {\n",
    "    \"0\": \"Two of our famous Belgian Waffles with plenty of real maple syrup\",\n",
    "    \"1\": \"Light Belgian waffles covered with strawberries and whipped cream\",\n",
    "    \"2\": \"Light Belgian waffles covered with an assortment of fresh berries and whipped cream\",\n",
    "    \"3\": \"Thick slices made from our homemade sourdough bread\",\n",
    "    \"4\": \"Two eggs, bacon or sausage, toast, and our ever-popular hash browns\"\n",
    "  },\n",
    "  \"calories\": {\n",
    "    \"0\": 650,\n",
    "    \"1\": 900,\n",
    "    \"2\": 900,\n",
    "    \"3\": 600,\n",
    "    \"4\": 950\n",
    "  }\n",
    "}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to `pd.read_csv`, `pd.read_excel` and `pd.read_xml`, you can read a JSON file using the `pd.read_json()` function. \n",
    "\n",
    "The first input is `filepath`. There is no `sep`, `header` or `index_col` parameter because the JSON files don't have flat structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>price</th>\n",
       "      <th>description</th>\n",
       "      <th>calories</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Belgian Waffles</td>\n",
       "      <td>$5.95</td>\n",
       "      <td>Two of our famous Belgian Waffles with plenty ...</td>\n",
       "      <td>650</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Strawberry Belgian Waffles</td>\n",
       "      <td>$7.95</td>\n",
       "      <td>Light Belgian waffles covered with strawberrie...</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Berry-Berry Belgian Waffles</td>\n",
       "      <td>$8.95</td>\n",
       "      <td>Light Belgian waffles covered with an assortme...</td>\n",
       "      <td>900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>French Toast</td>\n",
       "      <td>$4.50</td>\n",
       "      <td>Thick slices made from our homemade sourdough ...</td>\n",
       "      <td>600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Homestyle Breakfast</td>\n",
       "      <td>$6.95</td>\n",
       "      <td>Two eggs, bacon or sausage, toast, and our eve...</td>\n",
       "      <td>950</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          name  price  \\\n",
       "0              Belgian Waffles  $5.95   \n",
       "1   Strawberry Belgian Waffles  $7.95   \n",
       "2  Berry-Berry Belgian Waffles  $8.95   \n",
       "3                 French Toast  $4.50   \n",
       "4          Homestyle Breakfast  $6.95   \n",
       "\n",
       "                                         description  calories  \n",
       "0  Two of our famous Belgian Waffles with plenty ...       650  \n",
       "1  Light Belgian waffles covered with strawberrie...       900  \n",
       "2  Light Belgian waffles covered with an assortme...       900  \n",
       "3  Thick slices made from our homemade sourdough ...       600  \n",
       "4  Two eggs, bacon or sausage, toast, and our eve...       950  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "data = pd.read_json('../data/sample.json')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, pandas has a `df.to_json()` function to write a DataFrame to a JSON file. The parameter `orient` specifies the orientation of the JSON file. The default is `orient='records'` which means the JSON file is written in the records orientation. If you want to write the JSON file in the columns orientation, then `orient='columns'` should be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"name\":\"Belgian Waffles\",\"price\":\"$5.95\",\"description\":\"Two of our famous Belgian Waffles with plenty of real maple syrup\",\"calories\":650}\n",
      "{\"name\":\"Strawberry Belgian Waffles\",\"price\":\"$7.95\",\"description\":\"Light Belgian waffles covered with strawberries and whipped cream\",\"calories\":900}\n",
      "{\"name\":\"Berry-Berry Belgian Waffles\",\"price\":\"$8.95\",\"description\":\"Light Belgian waffles covered with an assortment of fresh berries and whipped cream\",\"calories\":900}\n",
      "{\"name\":\"French Toast\",\"price\":\"$4.50\",\"description\":\"Thick slices made from our homemade sourdough bread\",\"calories\":600}\n",
      "{\"name\":\"Homestyle Breakfast\",\"price\":\"$6.95\",\"description\":\"Two eggs, bacon or sausage, toast, and our ever-popular hash browns\",\"calories\":950}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "json_data = data.to_json(orient='records', lines=True)\n",
    "print(json_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('{\"name\":{\"0\":\"Belgian Waffles\",\"1\":\"Strawberry Belgian '\n",
      " 'Waffles\",\"2\":\"Berry-Berry Belgian Waffles\",\"3\":\"French Toast\",\"4\":\"Homestyle '\n",
      " 'Breakfast\"},\"price\":{\"0\":\"$5.95\",\"1\":\"$7.95\",\"2\":\"$8.95\",\"3\":\"$4.50\",\"4\":\"$6.95\"},\"description\":{\"0\":\"Two '\n",
      " 'of our famous Belgian Waffles with plenty of real maple syrup\",\"1\":\"Light '\n",
      " 'Belgian waffles covered with strawberries and whipped cream\",\"2\":\"Light '\n",
      " 'Belgian waffles covered with an assortment of fresh berries and whipped '\n",
      " 'cream\",\"3\":\"Thick slices made from our homemade sourdough bread\",\"4\":\"Two '\n",
      " 'eggs, bacon or sausage, toast, and our ever-popular hash '\n",
      " 'browns\"},\"calories\":{\"0\":650,\"1\":900,\"2\":900,\"3\":600,\"4\":950}}')\n"
     ]
    }
   ],
   "source": [
    "import pprint \n",
    "\n",
    "pprint.pprint(data.to_json(orient='columns'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}