{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e352599a",
   "metadata": {},
   "source": [
    "# US Census Bureau Median Housing Data\n",
    "\n",
    "The US Census Bureau publicly provides an extensive amount of information that it obtains. Sifting through and finding the right data you're looking for is difficult enough. Getting that data into a usable format that can then be uploaded to an RDBMS is something else entirely. This guide will walk us through how to retrieve the data, how to clean it and modify it within Python, and then how to export it to an Excel or CSV file that can be uploaded to an RDBMS. This is the same process used to create the MEDIAN_HOUSE_VALUE table on the USZIPCODE database for the University.\n",
    "## ZCTA Vs Zip Code\n",
    "\n",
    "For this example we will be using The Census Bureau derived ZCTA and not actual Zip Codes. You can read up on ZCTA more in the link below. We will also create a table that will allow us to convert and group Zip Codes to ZCTA so we can compare data in either format.\n",
    "<br>\n",
    "[Census Bureau ZCTAs](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html)\n",
    "<br>\n",
    "```{admonition} TL:DR\n",
    "ZCTA combine Zip Codes with very few addresses and assign the most frequent Zip within that grouping as the ZCTA.\n",
    "```\n",
    "\n",
    "## Obtaining The Data\n",
    "\n",
    "1.) Navigate to https://data.census.gov/cedsci/table. Here you can search for all of the different information that the US Census Bureau publicy provides. For this example, search for \"B25109\" or \"Median Value by Year Householder Moved Into Unit\" to use the same dataset, or another table of interest that allows you to break out that data by ZCTA5 code in the next step.\n",
    "<br>\n",
    "<br>\n",
    "2.) Click the \"Geos\" button in the table header and navigate to \"Zip Code Tabulation Area\" (we will refer to this as ZCTA from here on.) If you want to work with a smaller subset of data first, select a single state and then All ZCTA for that state. Or select the top button for all ZCTA in the US. This will likely take some time since the table is roughly 100,000 rows. Hit the \"Transpose\" button to format the data so that ZCTA, Estimate, and Margin of Error are the rows and the year ranges are the columns.\n",
    "<br>\n",
    "<br>\n",
    "```{note} As stated in the first step, not all tables will allow you to break out the data into ZCTA. While some of the steps can be used for data outside of ZCTA level data, this guide is explicitly for cleaning ZCTA data.\n",
    "```\n",
    "<br>\n",
    "<br>\n",
    "3.) The first column is now giving us a median of the medians and should be removed since we can do this calculation easily later if we wanted to. I also removed the final two columns of data before 2000, but these can be kept if you are interested in historical data going that far back. To remove these columns, click \"Columns\" on the far right side of the website and de-select the top two buttons with no title next to them, and the final two buttons for data before 2000.\n",
    "<br>\n",
    "<br>\n",
    "4.) Press the CSV button on the top header to export your dataset to a CSV and file and wait for it to download. This will likely take some time since it is almost 100,000 rows.\n",
    "\n",
    "## Importing The Data\n",
    "\n",
    "We're finally ready to start preparing our data! It is a good idea to move the downloaded file to the same folder you want the Jupyter Notebook or Python program to be saved in. I also renamed the file to easily identify what it was and easily read it into Jupyter.\n",
    "<br>\n",
    "<br>\n",
    "Execute the following cell looking at both the head and the tail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "735412e1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Label (Grouping)</th>\n",
       "      <th>Median value --!!Total:!!Moved in 2019 or later</th>\n",
       "      <th>Median value --!!Total:!!Moved in 2015 to 2018</th>\n",
       "      <th>Median value --!!Total:!!Moved in 2010 to 2014</th>\n",
       "      <th>Median value --!!Total:!!Moved in 2000 to 2009</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ZCTA5 00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>82,300</td>\n",
       "      <td>81,700</td>\n",
       "      <td>95,100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>**</td>\n",
       "      <td>±10,863</td>\n",
       "      <td>±29,876</td>\n",
       "      <td>±20,917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ZCTA5 00602</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>144,600</td>\n",
       "      <td>116,100</td>\n",
       "      <td>112,200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>**</td>\n",
       "      <td>±45,088</td>\n",
       "      <td>±15,512</td>\n",
       "      <td>±20,066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ZCTA5 00603</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>260,300</td>\n",
       "      <td>134,400</td>\n",
       "      <td>130,400</td>\n",
       "      <td>124,100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>±31,083</td>\n",
       "      <td>±57,761</td>\n",
       "      <td>±13,058</td>\n",
       "      <td>±12,732</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>ZCTA5 00606</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Label (Grouping) Median value --!!Total:!!Moved in 2019 or later  \\\n",
       "0          ZCTA5 00601                                             NaN   \n",
       "1             Estimate                                               -   \n",
       "2      Margin of Error                                              **   \n",
       "3          ZCTA5 00602                                             NaN   \n",
       "4             Estimate                                               -   \n",
       "5      Margin of Error                                              **   \n",
       "6          ZCTA5 00603                                             NaN   \n",
       "7             Estimate                                         260,300   \n",
       "8      Margin of Error                                         ±31,083   \n",
       "9          ZCTA5 00606                                             NaN   \n",
       "\n",
       "  Median value --!!Total:!!Moved in 2015 to 2018  \\\n",
       "0                                            NaN   \n",
       "1                                         82,300   \n",
       "2                                        ±10,863   \n",
       "3                                            NaN   \n",
       "4                                        144,600   \n",
       "5                                        ±45,088   \n",
       "6                                            NaN   \n",
       "7                                        134,400   \n",
       "8                                        ±57,761   \n",
       "9                                            NaN   \n",
       "\n",
       "  Median value --!!Total:!!Moved in 2010 to 2014  \\\n",
       "0                                            NaN   \n",
       "1                                         81,700   \n",
       "2                                        ±29,876   \n",
       "3                                            NaN   \n",
       "4                                        116,100   \n",
       "5                                        ±15,512   \n",
       "6                                            NaN   \n",
       "7                                        130,400   \n",
       "8                                        ±13,058   \n",
       "9                                            NaN   \n",
       "\n",
       "  Median value --!!Total:!!Moved in 2000 to 2009  \n",
       "0                                            NaN  \n",
       "1                                         95,100  \n",
       "2                                        ±20,917  \n",
       "3                                            NaN  \n",
       "4                                        112,200  \n",
       "5                                        ±20,066  \n",
       "6                                            NaN  \n",
       "7                                        124,100  \n",
       "8                                        ±12,732  \n",
       "9                                            NaN  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.read_csv(\"contrib_data/US_Housing_by_zip.csv\")\n",
    "\n",
    "df.head(10)\n",
    "#df.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "952fd543",
   "metadata": {},
   "source": [
    "## Cleaning the Data\n",
    "To put it simply, this data is ugly. The current format is essentially unqueryable. The data does follow a recognizable pattern though. \n",
    "<br>\n",
    "* Every 3 rows are ZCTA #, Estimate, Margin of Error with the Estimate and Margin of Error below it relatting to the ZCTA above\n",
    "* The ZCTA row always starts with 'ZCTA5' and has no values. \n",
    "* Margin of Error always starts with ± when it has a value. \n",
    "<br>\n",
    "<br>\n",
    "One other thing to note is the final 3 rows are an aggregate of all United States data. Since there is no single ZCTA associated with this row and we can easily run this aggregate with a query when we are finished, we will want to drop these last 3 rows. \n",
    "<br>\n",
    "<br>\n",
    "To get this data into a usable format we will:\n",
    "1. Rename the columns for easier management and querying\n",
    "    * This isn't necessary, but like working with SQL aliases this makes some things much easier\n",
    "2. Remove all ± and other special characters\n",
    "    * This will help with point 5 and is an easy cleaning step before processing\n",
    "3. Drop the last 3 rows for reasons stated above\n",
    "4. Match each 'Estimate' and 'Margin of Error' with the appropriate ZCTA\n",
    "    * This will get the data for each ZCTA into a single row\n",
    "5. Ensure each column is the proper data type\n",
    "    * Zip and ACTA can have leading zeros. We need to ensure they import with these zeros for joining\n",
    "6. Break out the year buckets into individual years\n",
    "    * This will make querying more intuitive and allow for easier appending when new data for years becomes available\n",
    "7. Drop any rows in the final dataframe that have no values\n",
    "<br>\n",
    "<br>\n",
    "These steps will produce a final dataframe that is easy to query, easy to manage, and match the schema of the other tables in the database. We will be taking the above, and transform it to the format below.\n",
    "\n",
    "```{list-table}\n",
    ":header-rows: 1\n",
    ":name: example-table\n",
    "\n",
    "* - ZCTA\n",
    "  - Estimate\n",
    "  - Margin_of_Error\n",
    "  - Year\n",
    "* - 00601\n",
    "  - 82300\n",
    "  - 10863\n",
    "  - 2015\n",
    "* - 00601\n",
    "  - 81700\n",
    "  - 29876\n",
    "  - 2014\n",
    "\n",
    "```\n",
    "\n",
    "Lets get started. Execute the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "505caa85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>2019_Or_Later</th>\n",
       "      <th>2015_To_2018</th>\n",
       "      <th>2010_To_2014</th>\n",
       "      <th>2000_To_2009</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ZCTA5 00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>82,300</td>\n",
       "      <td>81,700</td>\n",
       "      <td>95,100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>**</td>\n",
       "      <td>10,863</td>\n",
       "      <td>29,876</td>\n",
       "      <td>20,917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ZCTA5 00602</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>144,600</td>\n",
       "      <td>116,100</td>\n",
       "      <td>112,200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99355</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99356</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99357</th>\n",
       "      <td>ZCTA5 99929</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99358</th>\n",
       "      <td>Estimate</td>\n",
       "      <td>-</td>\n",
       "      <td>196,900</td>\n",
       "      <td>178,800</td>\n",
       "      <td>231,900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99359</th>\n",
       "      <td>Margin of Error</td>\n",
       "      <td>**</td>\n",
       "      <td>52,061</td>\n",
       "      <td>30,051</td>\n",
       "      <td>93,901</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>99360 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     ZCTA5 2019_Or_Later 2015_To_2018 2010_To_2014  \\\n",
       "0              ZCTA5 00601           NaN          NaN          NaN   \n",
       "1                 Estimate             -       82,300       81,700   \n",
       "2          Margin of Error            **       10,863       29,876   \n",
       "3              ZCTA5 00602           NaN          NaN          NaN   \n",
       "4                 Estimate             -      144,600      116,100   \n",
       "...                    ...           ...          ...          ...   \n",
       "99355             Estimate             -            -            -   \n",
       "99356      Margin of Error            **           **           **   \n",
       "99357          ZCTA5 99929           NaN          NaN          NaN   \n",
       "99358             Estimate             -      196,900      178,800   \n",
       "99359      Margin of Error            **       52,061       30,051   \n",
       "\n",
       "      2000_To_2009  \n",
       "0              NaN  \n",
       "1           95,100  \n",
       "2           20,917  \n",
       "3              NaN  \n",
       "4          112,200  \n",
       "...            ...  \n",
       "99355            -  \n",
       "99356           **  \n",
       "99357          NaN  \n",
       "99358      231,900  \n",
       "99359       93,901  \n",
       "\n",
       "[99360 rows x 5 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Rename the columns \n",
    "# Remove all of the ±\n",
    "# Drop the last 3 rows\n",
    "df2 =df\n",
    "df2.columns = ['ZCTA5', '2019_Or_Later', '2015_To_2018', '2010_To_2014', '2000_To_2009']\n",
    "df2 = df2.replace('^\\±', '', regex=True)\n",
    "df2.drop(df2.tail(3).index,inplace=True)\n",
    "\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5282e9f2",
   "metadata": {},
   "source": [
    "### Break the data apart\n",
    "In the next steps we will create new dataframes for all ZCTA, Estimate, and Margin of Error rows. When we break these out into their separate tables they will retain their order. Even though each row will not directly have a ZCTA associated with it, we know that the first row will be ZCTA 00601, the second row ZCTA 00602 and so on based in the index because the rows retained their order. This will then allow us to merge these 3 dataframes back together based on their index giving us a single row for each ZCTA with the Estimate 'EST' and Margin of Error 'MOE' for each year bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "090674f0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        00601\n",
       "1        00602\n",
       "2        00603\n",
       "3        00606\n",
       "4        00610\n",
       "         ...  \n",
       "33115    99923\n",
       "33116    99925\n",
       "33117    99926\n",
       "33118    99927\n",
       "33119    99929\n",
       "Name: ZCTA5, Length: 33120, dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filter for only rows that start with 'ZCTA' and create a new dataframe. Retain the order and reindex the rows.\n",
    "# Drop all columns except the first column because they contain no data\n",
    "# Remove the 'ZCTA5 ' from each row so we are left with just the ZCTA numbers\n",
    "ZCTA = df2[df2['ZCTA5'].str[:4]=='ZCTA'].copy().reset_index(drop=True)\n",
    "ZCTA.drop(ZCTA.columns[[1,2,3,4]], axis=1, inplace=True)\n",
    "ZCTA = ZCTA['ZCTA5'].str.split(' ').str[1]\n",
    "\n",
    "ZCTA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8b123ba8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2019_Or_Later_EST</th>\n",
       "      <th>2015_To_2018_EST</th>\n",
       "      <th>2010_To_2014_EST</th>\n",
       "      <th>2000_To_2009_EST</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-</td>\n",
       "      <td>82,300</td>\n",
       "      <td>81,700</td>\n",
       "      <td>95,100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-</td>\n",
       "      <td>144,600</td>\n",
       "      <td>116,100</td>\n",
       "      <td>112,200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>260,300</td>\n",
       "      <td>134,400</td>\n",
       "      <td>130,400</td>\n",
       "      <td>124,100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-</td>\n",
       "      <td>58,200</td>\n",
       "      <td>97,000</td>\n",
       "      <td>92,900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-</td>\n",
       "      <td>115,900</td>\n",
       "      <td>110,500</td>\n",
       "      <td>112,000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>-</td>\n",
       "      <td>263,900</td>\n",
       "      <td>339,200</td>\n",
       "      <td>181,300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>-</td>\n",
       "      <td>170,300</td>\n",
       "      <td>243,800</td>\n",
       "      <td>154,400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "      <td>-</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>-</td>\n",
       "      <td>196,900</td>\n",
       "      <td>178,800</td>\n",
       "      <td>231,900</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST 2000_To_2009_EST\n",
       "0                     -           82,300           81,700           95,100\n",
       "1                     -          144,600          116,100          112,200\n",
       "2               260,300          134,400          130,400          124,100\n",
       "3                     -           58,200           97,000           92,900\n",
       "4                     -          115,900          110,500          112,000\n",
       "...                 ...              ...              ...              ...\n",
       "33115                 -                -                -                -\n",
       "33116                 -          263,900          339,200          181,300\n",
       "33117                 -          170,300          243,800          154,400\n",
       "33118                 -                -                -                -\n",
       "33119                 -          196,900          178,800          231,900\n",
       "\n",
       "[33120 rows x 4 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filter the original dataframe for only rows that start with 'Estimate' and create a new dataframe. Retain the order and reindex the rows\n",
    "# Rename the columns identifying them as EST columns for each year bucket\n",
    "# Drop the first column because it would just say Estimate 33,000 times\n",
    "EST = df2[df2['ZCTA5'].str.contains('Estimate', regex=False)].copy().reset_index(drop=True)\n",
    "EST.columns = ['ZCTA5', '2019_Or_Later_EST', '2015_To_2018_EST', '2010_To_2014_EST', '2000_To_2009_EST']\n",
    "EST.drop(EST.columns[[0]], axis=1, inplace=True)\n",
    "\n",
    "EST"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "50b85ded",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2019_Or_Later_MOE</th>\n",
       "      <th>2015_To_2018_MOE</th>\n",
       "      <th>2010_To_2014_MOE</th>\n",
       "      <th>2000_To_2009_MOE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>**</td>\n",
       "      <td>10,863</td>\n",
       "      <td>29,876</td>\n",
       "      <td>20,917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>**</td>\n",
       "      <td>45,088</td>\n",
       "      <td>15,512</td>\n",
       "      <td>20,066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>31,083</td>\n",
       "      <td>57,761</td>\n",
       "      <td>13,058</td>\n",
       "      <td>12,732</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>**</td>\n",
       "      <td>25,881</td>\n",
       "      <td>37,195</td>\n",
       "      <td>14,438</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>**</td>\n",
       "      <td>26,647</td>\n",
       "      <td>15,103</td>\n",
       "      <td>14,945</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>**</td>\n",
       "      <td>31,175</td>\n",
       "      <td>33,706</td>\n",
       "      <td>19,820</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>**</td>\n",
       "      <td>20,982</td>\n",
       "      <td>162,405</td>\n",
       "      <td>31,403</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "      <td>**</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>**</td>\n",
       "      <td>52,061</td>\n",
       "      <td>30,051</td>\n",
       "      <td>93,901</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      2019_Or_Later_MOE 2015_To_2018_MOE 2010_To_2014_MOE 2000_To_2009_MOE\n",
       "0                    **           10,863           29,876           20,917\n",
       "1                    **           45,088           15,512           20,066\n",
       "2                31,083           57,761           13,058           12,732\n",
       "3                    **           25,881           37,195           14,438\n",
       "4                    **           26,647           15,103           14,945\n",
       "...                 ...              ...              ...              ...\n",
       "33115                **               **               **               **\n",
       "33116                **           31,175           33,706           19,820\n",
       "33117                **           20,982          162,405           31,403\n",
       "33118                **               **               **               **\n",
       "33119                **           52,061           30,051           93,901\n",
       "\n",
       "[33120 rows x 4 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same process as above but for MOE\n",
    "MOE = df2[df2['ZCTA5'].str.contains('Margin', regex=False)].copy().reset_index(drop=True)\n",
    "MOE.columns = ['ZCTA5', '2019_Or_Later_MOE', '2015_To_2018_MOE', '2010_To_2014_MOE', '2000_To_2009_MOE']\n",
    "MOE.drop(MOE.columns[[0]], axis=1, inplace=True)\n",
    "\n",
    "MOE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e0b896e",
   "metadata": {},
   "source": [
    "### Rejoin the data\n",
    "Here we will first rejoin the EST and MOE tables to clean them before joining back to the ZCTAs. Since we want to keep ZCTA in 'object' or text format, not having it with the other data will allow us to clean and put the rest of the data that we do want in a numeric format easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "8c0ae3d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2019_Or_Later_EST</th>\n",
       "      <th>2015_To_2018_EST</th>\n",
       "      <th>2010_To_2014_EST</th>\n",
       "      <th>2000_To_2009_EST</th>\n",
       "      <th>2019_Or_Later_MOE</th>\n",
       "      <th>2015_To_2018_MOE</th>\n",
       "      <th>2010_To_2014_MOE</th>\n",
       "      <th>2000_To_2009_MOE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>82300.0</td>\n",
       "      <td>81700.0</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10863.0</td>\n",
       "      <td>29876.0</td>\n",
       "      <td>20917.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>144600.0</td>\n",
       "      <td>116100.0</td>\n",
       "      <td>112200.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>45088.0</td>\n",
       "      <td>15512.0</td>\n",
       "      <td>20066.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>260300.0</td>\n",
       "      <td>134400.0</td>\n",
       "      <td>130400.0</td>\n",
       "      <td>124100.0</td>\n",
       "      <td>31083.0</td>\n",
       "      <td>57761.0</td>\n",
       "      <td>13058.0</td>\n",
       "      <td>12732.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>58200.0</td>\n",
       "      <td>97000.0</td>\n",
       "      <td>92900.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25881.0</td>\n",
       "      <td>37195.0</td>\n",
       "      <td>14438.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>115900.0</td>\n",
       "      <td>110500.0</td>\n",
       "      <td>112000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>26647.0</td>\n",
       "      <td>15103.0</td>\n",
       "      <td>14945.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>NaN</td>\n",
       "      <td>263900.0</td>\n",
       "      <td>339200.0</td>\n",
       "      <td>181300.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31175.0</td>\n",
       "      <td>33706.0</td>\n",
       "      <td>19820.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>NaN</td>\n",
       "      <td>170300.0</td>\n",
       "      <td>243800.0</td>\n",
       "      <td>154400.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20982.0</td>\n",
       "      <td>162405.0</td>\n",
       "      <td>31403.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>NaN</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>231900.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>93901.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       2019_Or_Later_EST  2015_To_2018_EST  2010_To_2014_EST  \\\n",
       "0                    NaN           82300.0           81700.0   \n",
       "1                    NaN          144600.0          116100.0   \n",
       "2               260300.0          134400.0          130400.0   \n",
       "3                    NaN           58200.0           97000.0   \n",
       "4                    NaN          115900.0          110500.0   \n",
       "...                  ...               ...               ...   \n",
       "33115                NaN               NaN               NaN   \n",
       "33116                NaN          263900.0          339200.0   \n",
       "33117                NaN          170300.0          243800.0   \n",
       "33118                NaN               NaN               NaN   \n",
       "33119                NaN          196900.0          178800.0   \n",
       "\n",
       "       2000_To_2009_EST  2019_Or_Later_MOE  2015_To_2018_MOE  \\\n",
       "0               95100.0                NaN           10863.0   \n",
       "1              112200.0                NaN           45088.0   \n",
       "2              124100.0            31083.0           57761.0   \n",
       "3               92900.0                NaN           25881.0   \n",
       "4              112000.0                NaN           26647.0   \n",
       "...                 ...                ...               ...   \n",
       "33115               NaN                NaN               NaN   \n",
       "33116          181300.0                NaN           31175.0   \n",
       "33117          154400.0                NaN           20982.0   \n",
       "33118               NaN                NaN               NaN   \n",
       "33119          231900.0                NaN           52061.0   \n",
       "\n",
       "       2010_To_2014_MOE  2000_To_2009_MOE  \n",
       "0               29876.0           20917.0  \n",
       "1               15512.0           20066.0  \n",
       "2               13058.0           12732.0  \n",
       "3               37195.0           14438.0  \n",
       "4               15103.0           14945.0  \n",
       "...                 ...               ...  \n",
       "33115               NaN               NaN  \n",
       "33116           33706.0           19820.0  \n",
       "33117          162405.0           31403.0  \n",
       "33118               NaN               NaN  \n",
       "33119           30051.0           93901.0  \n",
       "\n",
       "[33120 rows x 8 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Join the EST and MOE tables back together on the index\n",
    "# Remove commas because it seems to confuse pandas and doesn't allow conversion to numeric\n",
    "# Remove other special characters and replace with nan\n",
    "# Convert all data in the dataframe to numeric\n",
    "Merge = pd.concat([EST, MOE], axis=1)\n",
    "Merge = Merge.replace(',','', regex=True)\n",
    "Merge = Merge.replace({'-': np.nan, '**': np.nan, '***': np.nan})\n",
    "Merge = Merge.apply(pd.to_numeric, errors='coerce')\n",
    "\n",
    "Merge\n",
    "#Merge.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "064b22c4",
   "metadata": {},
   "source": [
    "```{note} Note that dtypes are float and we want int in the database. This is fine since the target table in the database will be int and that is how the data will import\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "3718e327",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>2019_Or_Later_EST</th>\n",
       "      <th>2015_To_2018_EST</th>\n",
       "      <th>2010_To_2014_EST</th>\n",
       "      <th>2000_To_2009_EST</th>\n",
       "      <th>2019_Or_Later_MOE</th>\n",
       "      <th>2015_To_2018_MOE</th>\n",
       "      <th>2010_To_2014_MOE</th>\n",
       "      <th>2000_To_2009_MOE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>82300.0</td>\n",
       "      <td>81700.0</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10863.0</td>\n",
       "      <td>29876.0</td>\n",
       "      <td>20917.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>NaN</td>\n",
       "      <td>144600.0</td>\n",
       "      <td>116100.0</td>\n",
       "      <td>112200.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>45088.0</td>\n",
       "      <td>15512.0</td>\n",
       "      <td>20066.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00603</td>\n",
       "      <td>260300.0</td>\n",
       "      <td>134400.0</td>\n",
       "      <td>130400.0</td>\n",
       "      <td>124100.0</td>\n",
       "      <td>31083.0</td>\n",
       "      <td>57761.0</td>\n",
       "      <td>13058.0</td>\n",
       "      <td>12732.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00606</td>\n",
       "      <td>NaN</td>\n",
       "      <td>58200.0</td>\n",
       "      <td>97000.0</td>\n",
       "      <td>92900.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25881.0</td>\n",
       "      <td>37195.0</td>\n",
       "      <td>14438.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00610</td>\n",
       "      <td>NaN</td>\n",
       "      <td>115900.0</td>\n",
       "      <td>110500.0</td>\n",
       "      <td>112000.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>26647.0</td>\n",
       "      <td>15103.0</td>\n",
       "      <td>14945.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>99923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>99925</td>\n",
       "      <td>NaN</td>\n",
       "      <td>263900.0</td>\n",
       "      <td>339200.0</td>\n",
       "      <td>181300.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31175.0</td>\n",
       "      <td>33706.0</td>\n",
       "      <td>19820.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>99926</td>\n",
       "      <td>NaN</td>\n",
       "      <td>170300.0</td>\n",
       "      <td>243800.0</td>\n",
       "      <td>154400.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20982.0</td>\n",
       "      <td>162405.0</td>\n",
       "      <td>31403.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>NaN</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>231900.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>93901.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  2019_Or_Later_EST  2015_To_2018_EST  2010_To_2014_EST  \\\n",
       "0      00601                NaN           82300.0           81700.0   \n",
       "1      00602                NaN          144600.0          116100.0   \n",
       "2      00603           260300.0          134400.0          130400.0   \n",
       "3      00606                NaN           58200.0           97000.0   \n",
       "4      00610                NaN          115900.0          110500.0   \n",
       "...      ...                ...               ...               ...   \n",
       "33115  99923                NaN               NaN               NaN   \n",
       "33116  99925                NaN          263900.0          339200.0   \n",
       "33117  99926                NaN          170300.0          243800.0   \n",
       "33118  99927                NaN               NaN               NaN   \n",
       "33119  99929                NaN          196900.0          178800.0   \n",
       "\n",
       "       2000_To_2009_EST  2019_Or_Later_MOE  2015_To_2018_MOE  \\\n",
       "0               95100.0                NaN           10863.0   \n",
       "1              112200.0                NaN           45088.0   \n",
       "2              124100.0            31083.0           57761.0   \n",
       "3               92900.0                NaN           25881.0   \n",
       "4              112000.0                NaN           26647.0   \n",
       "...                 ...                ...               ...   \n",
       "33115               NaN                NaN               NaN   \n",
       "33116          181300.0                NaN           31175.0   \n",
       "33117          154400.0                NaN           20982.0   \n",
       "33118               NaN                NaN               NaN   \n",
       "33119          231900.0                NaN           52061.0   \n",
       "\n",
       "       2010_To_2014_MOE  2000_To_2009_MOE  \n",
       "0               29876.0           20917.0  \n",
       "1               15512.0           20066.0  \n",
       "2               13058.0           12732.0  \n",
       "3               37195.0           14438.0  \n",
       "4               15103.0           14945.0  \n",
       "...                 ...               ...  \n",
       "33115               NaN               NaN  \n",
       "33116           33706.0           19820.0  \n",
       "33117          162405.0           31403.0  \n",
       "33118               NaN               NaN  \n",
       "33119           30051.0           93901.0  \n",
       "\n",
       "[33120 rows x 9 columns]"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Join the data with its resepctive ZCTA\n",
    "Merge2 = pd.concat([ZCTA, Merge], axis=1)\n",
    "\n",
    "Merge2\n",
    "#Merge2.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39e446b5",
   "metadata": {},
   "source": [
    "```{note} Note that ZCTA5 is dtype object and retained the leading zeros like we wanted\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b63136",
   "metadata": {},
   "source": [
    "### Break The Buckets Into Start and End  Years\n",
    "Since querying for an individual year is more intuitive than querying a column bucket year range, we will now break out each each bucket to its respective years. First, we will create tables with new columns for beginning and ending years of the buckets for year range. Then we will do the equivalent of a SQL UNION to bring them all back together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "386d9d13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00603</td>\n",
       "      <td>260300.0</td>\n",
       "      <td>31083.0</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00606</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00610</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>99923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>99925</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>99926</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year\n",
       "0      00601       NaN              NaN        2019      2020\n",
       "1      00602       NaN              NaN        2019      2020\n",
       "2      00603  260300.0          31083.0        2019      2020\n",
       "3      00606       NaN              NaN        2019      2020\n",
       "4      00610       NaN              NaN        2019      2020\n",
       "...      ...       ...              ...         ...       ...\n",
       "33115  99923       NaN              NaN        2019      2020\n",
       "33116  99925       NaN              NaN        2019      2020\n",
       "33117  99926       NaN              NaN        2019      2020\n",
       "33118  99927       NaN              NaN        2019      2020\n",
       "33119  99929       NaN              NaN        2019      2020\n",
       "\n",
       "[33120 rows x 5 columns]"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a new table for only ZCTA5 and the 2019_Or_Later buckets\n",
    "# Create a new column for the first year in the bucket\n",
    "# Create a new column for the last year in the bucket (in this case we do not have data after 2019 so end in 2020)\n",
    "# Rename the columns\n",
    "Trans1 = Merge2.filter(['ZCTA5', '2019_Or_Later_EST', '2019_Or_Later_MOE' ], axis=1)\n",
    "Trans1['Start_Year'] = 2019\n",
    "Trans1['End_Year'] = 2020\n",
    "Trans1.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n",
    "\n",
    "Trans1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "00f89984",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>82300.0</td>\n",
       "      <td>10863.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>144600.0</td>\n",
       "      <td>45088.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00603</td>\n",
       "      <td>134400.0</td>\n",
       "      <td>57761.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00606</td>\n",
       "      <td>58200.0</td>\n",
       "      <td>25881.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00610</td>\n",
       "      <td>115900.0</td>\n",
       "      <td>26647.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>99923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>99925</td>\n",
       "      <td>263900.0</td>\n",
       "      <td>31175.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>99926</td>\n",
       "      <td>170300.0</td>\n",
       "      <td>20982.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year\n",
       "0      00601   82300.0          10863.0        2015      2018\n",
       "1      00602  144600.0          45088.0        2015      2018\n",
       "2      00603  134400.0          57761.0        2015      2018\n",
       "3      00606   58200.0          25881.0        2015      2018\n",
       "4      00610  115900.0          26647.0        2015      2018\n",
       "...      ...       ...              ...         ...       ...\n",
       "33115  99923       NaN              NaN        2015      2018\n",
       "33116  99925  263900.0          31175.0        2015      2018\n",
       "33117  99926  170300.0          20982.0        2015      2018\n",
       "33118  99927       NaN              NaN        2015      2018\n",
       "33119  99929  196900.0          52061.0        2015      2018\n",
       "\n",
       "[33120 rows x 5 columns]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same as above except 2015 start and 2018 end\n",
    "# Rename the columns same as above to allow matching for union later\n",
    "Trans2 = Merge2.filter(['ZCTA5', '2015_To_2018_EST', '2015_To_2018_MOE'])\n",
    "Trans2['Start_Year'] = 2015\n",
    "Trans2['End_Year'] = 2018\n",
    "Trans2.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n",
    "\n",
    "Trans2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "b65cdcaa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>81700.0</td>\n",
       "      <td>29876.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>116100.0</td>\n",
       "      <td>15512.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00603</td>\n",
       "      <td>130400.0</td>\n",
       "      <td>13058.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00606</td>\n",
       "      <td>97000.0</td>\n",
       "      <td>37195.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00610</td>\n",
       "      <td>110500.0</td>\n",
       "      <td>15103.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>99923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>99925</td>\n",
       "      <td>339200.0</td>\n",
       "      <td>33706.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>99926</td>\n",
       "      <td>243800.0</td>\n",
       "      <td>162405.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year\n",
       "0      00601   81700.0          29876.0        2010      2014\n",
       "1      00602  116100.0          15512.0        2010      2014\n",
       "2      00603  130400.0          13058.0        2010      2014\n",
       "3      00606   97000.0          37195.0        2010      2014\n",
       "4      00610  110500.0          15103.0        2010      2014\n",
       "...      ...       ...              ...         ...       ...\n",
       "33115  99923       NaN              NaN        2010      2014\n",
       "33116  99925  339200.0          33706.0        2010      2014\n",
       "33117  99926  243800.0         162405.0        2010      2014\n",
       "33118  99927       NaN              NaN        2010      2014\n",
       "33119  99929  178800.0          30051.0        2010      2014\n",
       "\n",
       "[33120 rows x 5 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same as above except for 2010 start and 2014 end\n",
    "Trans3 = Merge2.filter(['ZCTA5', '2010_To_2014_EST', '2010_To_2014_MOE'])\n",
    "Trans3['Start_Year'] = 2010\n",
    "Trans3['End_Year'] = 2014\n",
    "Trans3.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n",
    "\n",
    "Trans3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "bd0d82c2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>112200.0</td>\n",
       "      <td>20066.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00603</td>\n",
       "      <td>124100.0</td>\n",
       "      <td>12732.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00606</td>\n",
       "      <td>92900.0</td>\n",
       "      <td>14438.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00610</td>\n",
       "      <td>112000.0</td>\n",
       "      <td>14945.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33115</th>\n",
       "      <td>99923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33116</th>\n",
       "      <td>99925</td>\n",
       "      <td>181300.0</td>\n",
       "      <td>19820.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33117</th>\n",
       "      <td>99926</td>\n",
       "      <td>154400.0</td>\n",
       "      <td>31403.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>231900.0</td>\n",
       "      <td>93901.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>33120 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year\n",
       "0      00601   95100.0          20917.0        2000      2009\n",
       "1      00602  112200.0          20066.0        2000      2009\n",
       "2      00603  124100.0          12732.0        2000      2009\n",
       "3      00606   92900.0          14438.0        2000      2009\n",
       "4      00610  112000.0          14945.0        2000      2009\n",
       "...      ...       ...              ...         ...       ...\n",
       "33115  99923       NaN              NaN        2000      2009\n",
       "33116  99925  181300.0          19820.0        2000      2009\n",
       "33117  99926  154400.0          31403.0        2000      2009\n",
       "33118  99927       NaN              NaN        2000      2009\n",
       "33119  99929  231900.0          93901.0        2000      2009\n",
       "\n",
       "[33120 rows x 5 columns]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Same as above except for 2000 start and 2009 end\n",
    "Trans4 = Merge2.filter(['ZCTA5', '2000_To_2009_EST', '2000_To_2009_MOE'])\n",
    "Trans4['Start_Year'] = 2000\n",
    "Trans4['End_Year'] = 2009\n",
    "Trans4.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n",
    "\n",
    "Trans4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "30490dda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>81700.0</td>\n",
       "      <td>29876.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>82300.0</td>\n",
       "      <td>10863.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>112200.0</td>\n",
       "      <td>20066.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>231900.0</td>\n",
       "      <td>93901.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>132480 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year\n",
       "0      00601   95100.0          20917.0        2000      2009\n",
       "0      00601   81700.0          29876.0        2010      2014\n",
       "0      00601   82300.0          10863.0        2015      2018\n",
       "0      00601       NaN              NaN        2019      2020\n",
       "1      00602  112200.0          20066.0        2000      2009\n",
       "...      ...       ...              ...         ...       ...\n",
       "33118  99927       NaN              NaN        2019      2020\n",
       "33119  99929  231900.0          93901.0        2000      2009\n",
       "33119  99929  178800.0          30051.0        2010      2014\n",
       "33119  99929  196900.0          52061.0        2015      2018\n",
       "33119  99929       NaN              NaN        2019      2020\n",
       "\n",
       "[132480 rows x 5 columns]"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the 4 new dataframes as list 'frames' \n",
    "# Bring all data back together by concatonating our frames list together. Sort the data by ZCTA and Year\n",
    "frames = [Trans1, Trans2, Trans3, Trans4]\n",
    "frames2 = pd.concat(frames).sort_values(['ZCTA5', 'Start_Year'])\n",
    "\n",
    "frames2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb878af7",
   "metadata": {},
   "source": [
    "### Break The Buckets To Individual Years\n",
    "Now we could do a SQL query for Year Between Star_Year and End_Year, but we want to be able to query for a single row and single year to be consistent with some of the other tables in the database. To do this, we will use the lambda function to create a column with a list of all the years in each bucket. Then we use explode to create a row for each Year in that new Year list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "347b2179",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Start_Year</th>\n",
       "      <th>End_Year</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "      <td>[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>81700.0</td>\n",
       "      <td>29876.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "      <td>[2010, 2011, 2012, 2013, 2014]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>82300.0</td>\n",
       "      <td>10863.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "      <td>[2015, 2016, 2017, 2018]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "      <td>[2019, 2020]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00602</td>\n",
       "      <td>112200.0</td>\n",
       "      <td>20066.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "      <td>[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33118</th>\n",
       "      <td>99927</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "      <td>[2019, 2020]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>231900.0</td>\n",
       "      <td>93901.0</td>\n",
       "      <td>2000</td>\n",
       "      <td>2009</td>\n",
       "      <td>[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>2010</td>\n",
       "      <td>2014</td>\n",
       "      <td>[2010, 2011, 2012, 2013, 2014]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2015</td>\n",
       "      <td>2018</td>\n",
       "      <td>[2015, 2016, 2017, 2018]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2019</td>\n",
       "      <td>2020</td>\n",
       "      <td>[2019, 2020]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>132480 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Start_Year  End_Year  \\\n",
       "0      00601   95100.0          20917.0        2000      2009   \n",
       "0      00601   81700.0          29876.0        2010      2014   \n",
       "0      00601   82300.0          10863.0        2015      2018   \n",
       "0      00601       NaN              NaN        2019      2020   \n",
       "1      00602  112200.0          20066.0        2000      2009   \n",
       "...      ...       ...              ...         ...       ...   \n",
       "33118  99927       NaN              NaN        2019      2020   \n",
       "33119  99929  231900.0          93901.0        2000      2009   \n",
       "33119  99929  178800.0          30051.0        2010      2014   \n",
       "33119  99929  196900.0          52061.0        2015      2018   \n",
       "33119  99929       NaN              NaN        2019      2020   \n",
       "\n",
       "                                                    Year  \n",
       "0      [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...  \n",
       "0                         [2010, 2011, 2012, 2013, 2014]  \n",
       "0                               [2015, 2016, 2017, 2018]  \n",
       "0                                           [2019, 2020]  \n",
       "1      [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...  \n",
       "...                                                  ...  \n",
       "33118                                       [2019, 2020]  \n",
       "33119  [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...  \n",
       "33119                     [2010, 2011, 2012, 2013, 2014]  \n",
       "33119                           [2015, 2016, 2017, 2018]  \n",
       "33119                                       [2019, 2020]  \n",
       "\n",
       "[132480 rows x 6 columns]"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create list of all years in the bucket\n",
    "frames2['Year'] = frames2.apply(lambda x: list(range(x['Start_Year'], x['End_Year']+1)), axis =1)\n",
    "\n",
    "frames2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "b914585d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZCTA5</th>\n",
       "      <th>ESTIMATE</th>\n",
       "      <th>MARGIN_OF_ERROR</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00601</td>\n",
       "      <td>95100.0</td>\n",
       "      <td>20917.0</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>178800.0</td>\n",
       "      <td>30051.0</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33119</th>\n",
       "      <td>99929</td>\n",
       "      <td>196900.0</td>\n",
       "      <td>52061.0</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>501841 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       ZCTA5  ESTIMATE  MARGIN_OF_ERROR  Year\n",
       "0      00601   95100.0          20917.0  2000\n",
       "0      00601   95100.0          20917.0  2001\n",
       "0      00601   95100.0          20917.0  2002\n",
       "0      00601   95100.0          20917.0  2003\n",
       "0      00601   95100.0          20917.0  2004\n",
       "...      ...       ...              ...   ...\n",
       "33119  99929  178800.0          30051.0  2014\n",
       "33119  99929  196900.0          52061.0  2015\n",
       "33119  99929  196900.0          52061.0  2016\n",
       "33119  99929  196900.0          52061.0  2017\n",
       "33119  99929  196900.0          52061.0  2018\n",
       "\n",
       "[501841 rows x 4 columns]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Explode the Year list to new columns and drop the now unnecessary start and end year columns\n",
    "# Drop any rows where there is no estimate. If there is no data, there is no reason for it to be in the databse.\n",
    "final = frames2.explode('Year').drop(columns=['Start_Year','End_Year'])\n",
    "final = final.dropna(subset = ['ESTIMATE'])\n",
    "\n",
    "final"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58e0e9bd",
   "metadata": {},
   "source": [
    "### Next Steps - Zip to ZCTA\n",
    "Now that we have our data in a format that matches some of the other tables in the database, we need to be able to match this ZCTA level table to Zip Code level tables. To do this, we will use the Zip to ZCTA Crosswalk provided from the link below. This will allow us to write a Group By statement in a SQL query that will convert Zip level data to ZCTA level data that can then be joined to our ZCTA level data. This table also has the added benefit of including the City and State name for each Zip.\n",
    "\n",
    "https://udsmapper.org/zip-code-to-zcta-crosswalk/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c2715ccc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ZIP_CODE</th>\n",
       "      <th>PO_NAME</th>\n",
       "      <th>STATE</th>\n",
       "      <th>ZIP_TYPE</th>\n",
       "      <th>ZCTA</th>\n",
       "      <th>zip_join_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>00501</td>\n",
       "      <td>Holtsville</td>\n",
       "      <td>NY</td>\n",
       "      <td>Post Office or large volume customer</td>\n",
       "      <td>11742</td>\n",
       "      <td>Spatial join to ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>00544</td>\n",
       "      <td>Holtsville</td>\n",
       "      <td>NY</td>\n",
       "      <td>Post Office or large volume customer</td>\n",
       "      <td>11742</td>\n",
       "      <td>Spatial join to ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>00601</td>\n",
       "      <td>Adjuntas</td>\n",
       "      <td>PR</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>00601</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>00602</td>\n",
       "      <td>Aguada</td>\n",
       "      <td>PR</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>00602</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>00603</td>\n",
       "      <td>Aguadilla</td>\n",
       "      <td>PR</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>00603</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41086</th>\n",
       "      <td>99926</td>\n",
       "      <td>Metlakatla</td>\n",
       "      <td>AK</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>99926</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41087</th>\n",
       "      <td>99927</td>\n",
       "      <td>Point Baker</td>\n",
       "      <td>AK</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>99927</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41088</th>\n",
       "      <td>99928</td>\n",
       "      <td>Ward Cove</td>\n",
       "      <td>AK</td>\n",
       "      <td>Post Office or large volume customer</td>\n",
       "      <td>99901</td>\n",
       "      <td>Spatial join to ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41089</th>\n",
       "      <td>99929</td>\n",
       "      <td>Wrangell</td>\n",
       "      <td>AK</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>99929</td>\n",
       "      <td>Zip matches ZCTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41090</th>\n",
       "      <td>99950</td>\n",
       "      <td>Ketchikan</td>\n",
       "      <td>AK</td>\n",
       "      <td>Zip Code Area</td>\n",
       "      <td>99921</td>\n",
       "      <td>Spatial join to ZCTA</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>41091 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      ZIP_CODE      PO_NAME STATE                              ZIP_TYPE  \\\n",
       "0        00501   Holtsville    NY  Post Office or large volume customer   \n",
       "1        00544   Holtsville    NY  Post Office or large volume customer   \n",
       "2        00601     Adjuntas    PR                         Zip Code Area   \n",
       "3        00602       Aguada    PR                         Zip Code Area   \n",
       "4        00603    Aguadilla    PR                         Zip Code Area   \n",
       "...        ...          ...   ...                                   ...   \n",
       "41086    99926   Metlakatla    AK                         Zip Code Area   \n",
       "41087    99927  Point Baker    AK                         Zip Code Area   \n",
       "41088    99928    Ward Cove    AK  Post Office or large volume customer   \n",
       "41089    99929     Wrangell    AK                         Zip Code Area   \n",
       "41090    99950    Ketchikan    AK                         Zip Code Area   \n",
       "\n",
       "        ZCTA         zip_join_type  \n",
       "0      11742  Spatial join to ZCTA  \n",
       "1      11742  Spatial join to ZCTA  \n",
       "2      00601      Zip matches ZCTA  \n",
       "3      00602      Zip matches ZCTA  \n",
       "4      00603      Zip matches ZCTA  \n",
       "...      ...                   ...  \n",
       "41086  99926      Zip matches ZCTA  \n",
       "41087  99927      Zip matches ZCTA  \n",
       "41088  99901  Spatial join to ZCTA  \n",
       "41089  99929      Zip matches ZCTA  \n",
       "41090  99921  Spatial join to ZCTA  \n",
       "\n",
       "[41091 rows x 6 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Specify each column as str on import or it will drop the leading zeros that we need.\n",
    "df = pd.read_excel(\"contrib_data/ZIPCodetoZCTACrosswalk2021UDS.xlsx\", dtype = {'ZIP_CODE': str, 'PO_NAME': str, 'STATE': str, 'ZIP_TYPE': str, 'ZCTA': str, 'zip_join_type': str})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2973105e",
   "metadata": {},
   "source": [
    "### Ready to Import to Database\n",
    "Both of the tables are now ready to import into the database. Refer to the P4Analytics page on different methods to connect and import data.\n",
    "### Get Querying\n",
    "The following query is an example of how we might use some of the data we have. Lets say we wanted to look at the average income and the median home sale price in all Little Rock AR Zips for 2018. Our income data is at the Zip Code level and the Median housing data as at the Zip code level. Because of the 2 different levels of data, we will first need to group the Zip code level income data into ZCTAs. We can then use this Income ZCTA table as a subquery and join it to ZCTA median housing data. Because we know that the ZCTA is associated with the most frequent Zip, we can also join it back to the Zip to ZCTA conversion table to get the City and State of that Zip/ZCTA and filter the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d2c860d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pypyodbc as pyodbc\n",
    "#Change the values below to your personal and server credentials\n",
    "\n",
    "svr = 'the_server'\n",
    "uid = 'your_user_id'\n",
    "pwd = 'your_password'\n",
    "db = 'the_database'\n",
    "\n",
    "cnxn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+svr+';DATABASE='+db+';UID='+uid+';PWD='+ pwd+';TrustServerCertificate=yes;')\n",
    "\n",
    "query = \"select d.PO_NAME as City \\\n",
    ", d.STATE \\\n",
    ", INCOME.* \\\n",
    ", c.ESTIMATE as MEDIAN_HOUSING_ESTIMATE \\\n",
    ", c.ESTIMATE - c.MARGIN_OF_ERROR as MEDIAN_HOUSE_LB \\\n",
    ", c.ESTIMATE + c.MARGIN_OF_ERROR as MEDIAN_HOUSE_UB \\\n",
    "from(select b.ZCTA \\\n",
    ", sum(a.Number_Of_Returns) as NBR_RTRNS \\\n",
    ", floor(avg(a.Total_Income_Amount)) as AVG_INCOME \\\n",
    ", a.Date_Year \\\n",
    "from IRS_ZIPCODE_SOURCE a \\\n",
    "join ZIP_TO_ZCTA b \\\n",
    "on a.Zipcode = b.ZIP_CODE \\\n",
    "group by b.ZCTA, a.Date_Year) as INCOME \\\n",
    "join MEDIAN_HOUSE_VALUE c \\\n",
    "on INCOME.ZCTA = c.ZCTA5 \\\n",
    "and INCOME.Date_Year = c.Year \\\n",
    "join ZIP_TO_ZCTA d \\\n",
    "on c.ZCTA5 = d.ZIP_CODE \\\n",
    "where d.STATE = 'AR' \\\n",
    "and d.PO_NAME = 'Little Rock' \\\n",
    "and INCOME.Date_Year = 2018\"\n",
    "\n",
    "df = pd.read_sql(query, cnxn)\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}