{ "cells": [ { "cell_type": "markdown", "id": "e352599a", "metadata": {}, "source": [ "# US Census Bureau Median Housing Data\n", "\n", "The US Census Bureau publicly provides an extensive amount of information that it obtains. Sifting through and finding the right data you're looking for is difficult enough. Getting that data into a usable format that can then be uploaded to an RDBMS is something else entirely. This guide will walk us through how to retrieve the data, how to clean it and modify it within Python, and then how to export it to an Excel or CSV file that can be uploaded to an RDBMS. This is the same process used to create the MEDIAN_HOUSE_VALUE table on the USZIPCODE database for the University.\n", "## ZCTA Vs Zip Code\n", "\n", "For this example we will be using The Census Bureau derived ZCTA and not actual Zip Codes. You can read up on ZCTA more in the link below. We will also create a table that will allow us to convert and group Zip Codes to ZCTA so we can compare data in either format.\n", "
\n", "[Census Bureau ZCTAs](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html)\n", "
\n", "```{admonition} TL:DR\n", "ZCTA combine Zip Codes with very few addresses and assign the most frequent Zip within that grouping as the ZCTA.\n", "```\n", "\n", "## Obtaining The Data\n", "\n", "1.) Navigate to https://data.census.gov/cedsci/table. Here you can search for all of the different information that the US Census Bureau publicy provides. For this example, search for \"B25109\" or \"Median Value by Year Householder Moved Into Unit\" to use the same dataset, or another table of interest that allows you to break out that data by ZCTA5 code in the next step.\n", "
\n", "
\n", "2.) Click the \"Geos\" button in the table header and navigate to \"Zip Code Tabulation Area\" (we will refer to this as ZCTA from here on.) If you want to work with a smaller subset of data first, select a single state and then All ZCTA for that state. Or select the top button for all ZCTA in the US. This will likely take some time since the table is roughly 100,000 rows. Hit the \"Transpose\" button to format the data so that ZCTA, Estimate, and Margin of Error are the rows and the year ranges are the columns.\n", "
\n", "
\n", "```{note} As stated in the first step, not all tables will allow you to break out the data into ZCTA. While some of the steps can be used for data outside of ZCTA level data, this guide is explicitly for cleaning ZCTA data.\n", "```\n", "
\n", "
\n", "3.) The first column is now giving us a median of the medians and should be removed since we can do this calculation easily later if we wanted to. I also removed the final two columns of data before 2000, but these can be kept if you are interested in historical data going that far back. To remove these columns, click \"Columns\" on the far right side of the website and de-select the top two buttons with no title next to them, and the final two buttons for data before 2000.\n", "
\n", "
\n", "4.) Press the CSV button on the top header to export your dataset to a CSV and file and wait for it to download. This will likely take some time since it is almost 100,000 rows.\n", "\n", "## Importing The Data\n", "\n", "We're finally ready to start preparing our data! It is a good idea to move the downloaded file to the same folder you want the Jupyter Notebook or Python program to be saved in. I also renamed the file to easily identify what it was and easily read it into Jupyter.\n", "
\n", "
\n", "Execute the following cell looking at both the head and the tail." ] }, { "cell_type": "code", "execution_count": 2, "id": "735412e1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Label (Grouping)Median value --!!Total:!!Moved in 2019 or laterMedian value --!!Total:!!Moved in 2015 to 2018Median value --!!Total:!!Moved in 2010 to 2014Median value --!!Total:!!Moved in 2000 to 2009
0ZCTA5 00601NaNNaNNaNNaN
1Estimate-82,30081,70095,100
2Margin of Error**±10,863±29,876±20,917
3ZCTA5 00602NaNNaNNaNNaN
4Estimate-144,600116,100112,200
5Margin of Error**±45,088±15,512±20,066
6ZCTA5 00603NaNNaNNaNNaN
7Estimate260,300134,400130,400124,100
8Margin of Error±31,083±57,761±13,058±12,732
9ZCTA5 00606NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " Label (Grouping) Median value --!!Total:!!Moved in 2019 or later \\\n", "0 ZCTA5 00601 NaN \n", "1     Estimate - \n", "2     Margin of Error ** \n", "3 ZCTA5 00602 NaN \n", "4     Estimate - \n", "5     Margin of Error ** \n", "6 ZCTA5 00603 NaN \n", "7     Estimate 260,300 \n", "8     Margin of Error ±31,083 \n", "9 ZCTA5 00606 NaN \n", "\n", " Median value --!!Total:!!Moved in 2015 to 2018 \\\n", "0 NaN \n", "1 82,300 \n", "2 ±10,863 \n", "3 NaN \n", "4 144,600 \n", "5 ±45,088 \n", "6 NaN \n", "7 134,400 \n", "8 ±57,761 \n", "9 NaN \n", "\n", " Median value --!!Total:!!Moved in 2010 to 2014 \\\n", "0 NaN \n", "1 81,700 \n", "2 ±29,876 \n", "3 NaN \n", "4 116,100 \n", "5 ±15,512 \n", "6 NaN \n", "7 130,400 \n", "8 ±13,058 \n", "9 NaN \n", "\n", " Median value --!!Total:!!Moved in 2000 to 2009 \n", "0 NaN \n", "1 95,100 \n", "2 ±20,917 \n", "3 NaN \n", "4 112,200 \n", "5 ±20,066 \n", "6 NaN \n", "7 124,100 \n", "8 ±12,732 \n", "9 NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv(\"contrib_data/US_Housing_by_zip.csv\")\n", "\n", "df.head(10)\n", "#df.tail(10)" ] }, { "cell_type": "markdown", "id": "952fd543", "metadata": {}, "source": [ "## Cleaning the Data\n", "To put it simply, this data is ugly. The current format is essentially unqueryable. The data does follow a recognizable pattern though. \n", "
\n", "* Every 3 rows are ZCTA #, Estimate, Margin of Error with the Estimate and Margin of Error below it relatting to the ZCTA above\n", "* The ZCTA row always starts with 'ZCTA5' and has no values. \n", "* Margin of Error always starts with ± when it has a value. \n", "
\n", "
\n", "One other thing to note is the final 3 rows are an aggregate of all United States data. Since there is no single ZCTA associated with this row and we can easily run this aggregate with a query when we are finished, we will want to drop these last 3 rows. \n", "
\n", "
\n", "To get this data into a usable format we will:\n", "1. Rename the columns for easier management and querying\n", " * This isn't necessary, but like working with SQL aliases this makes some things much easier\n", "2. Remove all ± and other special characters\n", " * This will help with point 5 and is an easy cleaning step before processing\n", "3. Drop the last 3 rows for reasons stated above\n", "4. Match each 'Estimate' and 'Margin of Error' with the appropriate ZCTA\n", " * This will get the data for each ZCTA into a single row\n", "5. Ensure each column is the proper data type\n", " * Zip and ACTA can have leading zeros. We need to ensure they import with these zeros for joining\n", "6. Break out the year buckets into individual years\n", " * This will make querying more intuitive and allow for easier appending when new data for years becomes available\n", "7. Drop any rows in the final dataframe that have no values\n", "
\n", "
\n", "These steps will produce a final dataframe that is easy to query, easy to manage, and match the schema of the other tables in the database. We will be taking the above, and transform it to the format below.\n", "\n", "```{list-table}\n", ":header-rows: 1\n", ":name: example-table\n", "\n", "* - ZCTA\n", " - Estimate\n", " - Margin_of_Error\n", " - Year\n", "* - 00601\n", " - 82300\n", " - 10863\n", " - 2015\n", "* - 00601\n", " - 81700\n", " - 29876\n", " - 2014\n", "\n", "```\n", "\n", "Lets get started. Execute the following cell." ] }, { "cell_type": "code", "execution_count": 13, "id": "505caa85", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA52019_Or_Later2015_To_20182010_To_20142000_To_2009
0ZCTA5 00601NaNNaNNaNNaN
1Estimate-82,30081,70095,100
2Margin of Error**10,86329,87620,917
3ZCTA5 00602NaNNaNNaNNaN
4Estimate-144,600116,100112,200
..................
99355Estimate----
99356Margin of Error********
99357ZCTA5 99929NaNNaNNaNNaN
99358Estimate-196,900178,800231,900
99359Margin of Error**52,06130,05193,901
\n", "

99360 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 2019_Or_Later 2015_To_2018 2010_To_2014 \\\n", "0 ZCTA5 00601 NaN NaN NaN \n", "1     Estimate - 82,300 81,700 \n", "2     Margin of Error ** 10,863 29,876 \n", "3 ZCTA5 00602 NaN NaN NaN \n", "4     Estimate - 144,600 116,100 \n", "... ... ... ... ... \n", "99355     Estimate - - - \n", "99356     Margin of Error ** ** ** \n", "99357 ZCTA5 99929 NaN NaN NaN \n", "99358     Estimate - 196,900 178,800 \n", "99359     Margin of Error ** 52,061 30,051 \n", "\n", " 2000_To_2009 \n", "0 NaN \n", "1 95,100 \n", "2 20,917 \n", "3 NaN \n", "4 112,200 \n", "... ... \n", "99355 - \n", "99356 ** \n", "99357 NaN \n", "99358 231,900 \n", "99359 93,901 \n", "\n", "[99360 rows x 5 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Rename the columns \n", "# Remove all of the ±\n", "# Drop the last 3 rows\n", "df2 =df\n", "df2.columns = ['ZCTA5', '2019_Or_Later', '2015_To_2018', '2010_To_2014', '2000_To_2009']\n", "df2 = df2.replace('^\\±', '', regex=True)\n", "df2.drop(df2.tail(3).index,inplace=True)\n", "\n", "df2" ] }, { "cell_type": "markdown", "id": "5282e9f2", "metadata": {}, "source": [ "### Break the data apart\n", "In the next steps we will create new dataframes for all ZCTA, Estimate, and Margin of Error rows. When we break these out into their separate tables they will retain their order. Even though each row will not directly have a ZCTA associated with it, we know that the first row will be ZCTA 00601, the second row ZCTA 00602 and so on based in the index because the rows retained their order. This will then allow us to merge these 3 dataframes back together based on their index giving us a single row for each ZCTA with the Estimate 'EST' and Margin of Error 'MOE' for each year bucket." ] }, { "cell_type": "code", "execution_count": 14, "id": "090674f0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 00601\n", "1 00602\n", "2 00603\n", "3 00606\n", "4 00610\n", " ... \n", "33115 99923\n", "33116 99925\n", "33117 99926\n", "33118 99927\n", "33119 99929\n", "Name: ZCTA5, Length: 33120, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter for only rows that start with 'ZCTA' and create a new dataframe. Retain the order and reindex the rows.\n", "# Drop all columns except the first column because they contain no data\n", "# Remove the 'ZCTA5 ' from each row so we are left with just the ZCTA numbers\n", "ZCTA = df2[df2['ZCTA5'].str[:4]=='ZCTA'].copy().reset_index(drop=True)\n", "ZCTA.drop(ZCTA.columns[[1,2,3,4]], axis=1, inplace=True)\n", "ZCTA = ZCTA['ZCTA5'].str.split(' ').str[1]\n", "\n", "ZCTA" ] }, { "cell_type": "code", "execution_count": 15, "id": "8b123ba8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2019_Or_Later_EST2015_To_2018_EST2010_To_2014_EST2000_To_2009_EST
0-82,30081,70095,100
1-144,600116,100112,200
2260,300134,400130,400124,100
3-58,20097,00092,900
4-115,900110,500112,000
...............
33115----
33116-263,900339,200181,300
33117-170,300243,800154,400
33118----
33119-196,900178,800231,900
\n", "

33120 rows × 4 columns

\n", "
" ], "text/plain": [ " 2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST 2000_To_2009_EST\n", "0 - 82,300 81,700 95,100\n", "1 - 144,600 116,100 112,200\n", "2 260,300 134,400 130,400 124,100\n", "3 - 58,200 97,000 92,900\n", "4 - 115,900 110,500 112,000\n", "... ... ... ... ...\n", "33115 - - - -\n", "33116 - 263,900 339,200 181,300\n", "33117 - 170,300 243,800 154,400\n", "33118 - - - -\n", "33119 - 196,900 178,800 231,900\n", "\n", "[33120 rows x 4 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter the original dataframe for only rows that start with 'Estimate' and create a new dataframe. Retain the order and reindex the rows\n", "# Rename the columns identifying them as EST columns for each year bucket\n", "# Drop the first column because it would just say Estimate 33,000 times\n", "EST = df2[df2['ZCTA5'].str.contains('Estimate', regex=False)].copy().reset_index(drop=True)\n", "EST.columns = ['ZCTA5', '2019_Or_Later_EST', '2015_To_2018_EST', '2010_To_2014_EST', '2000_To_2009_EST']\n", "EST.drop(EST.columns[[0]], axis=1, inplace=True)\n", "\n", "EST" ] }, { "cell_type": "code", "execution_count": 16, "id": "50b85ded", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2019_Or_Later_MOE2015_To_2018_MOE2010_To_2014_MOE2000_To_2009_MOE
0**10,86329,87620,917
1**45,08815,51220,066
231,08357,76113,05812,732
3**25,88137,19514,438
4**26,64715,10314,945
...............
33115********
33116**31,17533,70619,820
33117**20,982162,40531,403
33118********
33119**52,06130,05193,901
\n", "

33120 rows × 4 columns

\n", "
" ], "text/plain": [ " 2019_Or_Later_MOE 2015_To_2018_MOE 2010_To_2014_MOE 2000_To_2009_MOE\n", "0 ** 10,863 29,876 20,917\n", "1 ** 45,088 15,512 20,066\n", "2 31,083 57,761 13,058 12,732\n", "3 ** 25,881 37,195 14,438\n", "4 ** 26,647 15,103 14,945\n", "... ... ... ... ...\n", "33115 ** ** ** **\n", "33116 ** 31,175 33,706 19,820\n", "33117 ** 20,982 162,405 31,403\n", "33118 ** ** ** **\n", "33119 ** 52,061 30,051 93,901\n", "\n", "[33120 rows x 4 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same process as above but for MOE\n", "MOE = df2[df2['ZCTA5'].str.contains('Margin', regex=False)].copy().reset_index(drop=True)\n", "MOE.columns = ['ZCTA5', '2019_Or_Later_MOE', '2015_To_2018_MOE', '2010_To_2014_MOE', '2000_To_2009_MOE']\n", "MOE.drop(MOE.columns[[0]], axis=1, inplace=True)\n", "\n", "MOE" ] }, { "cell_type": "markdown", "id": "1e0b896e", "metadata": {}, "source": [ "### Rejoin the data\n", "Here we will first rejoin the EST and MOE tables to clean them before joining back to the ZCTAs. Since we want to keep ZCTA in 'object' or text format, not having it with the other data will allow us to clean and put the rest of the data that we do want in a numeric format easily." ] }, { "cell_type": "code", "execution_count": 40, "id": "8c0ae3d8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
2019_Or_Later_EST2015_To_2018_EST2010_To_2014_EST2000_To_2009_EST2019_Or_Later_MOE2015_To_2018_MOE2010_To_2014_MOE2000_To_2009_MOE
0NaN82300.081700.095100.0NaN10863.029876.020917.0
1NaN144600.0116100.0112200.0NaN45088.015512.020066.0
2260300.0134400.0130400.0124100.031083.057761.013058.012732.0
3NaN58200.097000.092900.0NaN25881.037195.014438.0
4NaN115900.0110500.0112000.0NaN26647.015103.014945.0
...........................
33115NaNNaNNaNNaNNaNNaNNaNNaN
33116NaN263900.0339200.0181300.0NaN31175.033706.019820.0
33117NaN170300.0243800.0154400.0NaN20982.0162405.031403.0
33118NaNNaNNaNNaNNaNNaNNaNNaN
33119NaN196900.0178800.0231900.0NaN52061.030051.093901.0
\n", "

33120 rows × 8 columns

\n", "
" ], "text/plain": [ " 2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST \\\n", "0 NaN 82300.0 81700.0 \n", "1 NaN 144600.0 116100.0 \n", "2 260300.0 134400.0 130400.0 \n", "3 NaN 58200.0 97000.0 \n", "4 NaN 115900.0 110500.0 \n", "... ... ... ... \n", "33115 NaN NaN NaN \n", "33116 NaN 263900.0 339200.0 \n", "33117 NaN 170300.0 243800.0 \n", "33118 NaN NaN NaN \n", "33119 NaN 196900.0 178800.0 \n", "\n", " 2000_To_2009_EST 2019_Or_Later_MOE 2015_To_2018_MOE \\\n", "0 95100.0 NaN 10863.0 \n", "1 112200.0 NaN 45088.0 \n", "2 124100.0 31083.0 57761.0 \n", "3 92900.0 NaN 25881.0 \n", "4 112000.0 NaN 26647.0 \n", "... ... ... ... \n", "33115 NaN NaN NaN \n", "33116 181300.0 NaN 31175.0 \n", "33117 154400.0 NaN 20982.0 \n", "33118 NaN NaN NaN \n", "33119 231900.0 NaN 52061.0 \n", "\n", " 2010_To_2014_MOE 2000_To_2009_MOE \n", "0 29876.0 20917.0 \n", "1 15512.0 20066.0 \n", "2 13058.0 12732.0 \n", "3 37195.0 14438.0 \n", "4 15103.0 14945.0 \n", "... ... ... \n", "33115 NaN NaN \n", "33116 33706.0 19820.0 \n", "33117 162405.0 31403.0 \n", "33118 NaN NaN \n", "33119 30051.0 93901.0 \n", "\n", "[33120 rows x 8 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the EST and MOE tables back together on the index\n", "# Remove commas because it seems to confuse pandas and doesn't allow conversion to numeric\n", "# Remove other special characters and replace with nan\n", "# Convert all data in the dataframe to numeric\n", "Merge = pd.concat([EST, MOE], axis=1)\n", "Merge = Merge.replace(',','', regex=True)\n", "Merge = Merge.replace({'-': np.nan, '**': np.nan, '***': np.nan})\n", "Merge = Merge.apply(pd.to_numeric, errors='coerce')\n", "\n", "Merge\n", "#Merge.dtypes" ] }, { "cell_type": "markdown", "id": "064b22c4", "metadata": {}, "source": [ "```{note} Note that dtypes are float and we want int in the database. This is fine since the target table in the database will be int and that is how the data will import\n", "```" ] }, { "cell_type": "code", "execution_count": 44, "id": "3718e327", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA52019_Or_Later_EST2015_To_2018_EST2010_To_2014_EST2000_To_2009_EST2019_Or_Later_MOE2015_To_2018_MOE2010_To_2014_MOE2000_To_2009_MOE
000601NaN82300.081700.095100.0NaN10863.029876.020917.0
100602NaN144600.0116100.0112200.0NaN45088.015512.020066.0
200603260300.0134400.0130400.0124100.031083.057761.013058.012732.0
300606NaN58200.097000.092900.0NaN25881.037195.014438.0
400610NaN115900.0110500.0112000.0NaN26647.015103.014945.0
..............................
3311599923NaNNaNNaNNaNNaNNaNNaNNaN
3311699925NaN263900.0339200.0181300.0NaN31175.033706.019820.0
3311799926NaN170300.0243800.0154400.0NaN20982.0162405.031403.0
3311899927NaNNaNNaNNaNNaNNaNNaNNaN
3311999929NaN196900.0178800.0231900.0NaN52061.030051.093901.0
\n", "

33120 rows × 9 columns

\n", "
" ], "text/plain": [ " ZCTA5 2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST \\\n", "0 00601 NaN 82300.0 81700.0 \n", "1 00602 NaN 144600.0 116100.0 \n", "2 00603 260300.0 134400.0 130400.0 \n", "3 00606 NaN 58200.0 97000.0 \n", "4 00610 NaN 115900.0 110500.0 \n", "... ... ... ... ... \n", "33115 99923 NaN NaN NaN \n", "33116 99925 NaN 263900.0 339200.0 \n", "33117 99926 NaN 170300.0 243800.0 \n", "33118 99927 NaN NaN NaN \n", "33119 99929 NaN 196900.0 178800.0 \n", "\n", " 2000_To_2009_EST 2019_Or_Later_MOE 2015_To_2018_MOE \\\n", "0 95100.0 NaN 10863.0 \n", "1 112200.0 NaN 45088.0 \n", "2 124100.0 31083.0 57761.0 \n", "3 92900.0 NaN 25881.0 \n", "4 112000.0 NaN 26647.0 \n", "... ... ... ... \n", "33115 NaN NaN NaN \n", "33116 181300.0 NaN 31175.0 \n", "33117 154400.0 NaN 20982.0 \n", "33118 NaN NaN NaN \n", "33119 231900.0 NaN 52061.0 \n", "\n", " 2010_To_2014_MOE 2000_To_2009_MOE \n", "0 29876.0 20917.0 \n", "1 15512.0 20066.0 \n", "2 13058.0 12732.0 \n", "3 37195.0 14438.0 \n", "4 15103.0 14945.0 \n", "... ... ... \n", "33115 NaN NaN \n", "33116 33706.0 19820.0 \n", "33117 162405.0 31403.0 \n", "33118 NaN NaN \n", "33119 30051.0 93901.0 \n", "\n", "[33120 rows x 9 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the data with its resepctive ZCTA\n", "Merge2 = pd.concat([ZCTA, Merge], axis=1)\n", "\n", "Merge2\n", "#Merge2.dtypes" ] }, { "cell_type": "markdown", "id": "39e446b5", "metadata": {}, "source": [ "```{note} Note that ZCTA5 is dtype object and retained the leading zeros like we wanted\n", "```" ] }, { "cell_type": "markdown", "id": "b4b63136", "metadata": {}, "source": [ "### Break The Buckets Into Start and End Years\n", "Since querying for an individual year is more intuitive than querying a column bucket year range, we will now break out each each bucket to its respective years. First, we will create tables with new columns for beginning and ending years of the buckets for year range. Then we will do the equivalent of a SQL UNION to bring them all back together." ] }, { "cell_type": "code", "execution_count": 45, "id": "386d9d13", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_Year
000601NaNNaN20192020
100602NaNNaN20192020
200603260300.031083.020192020
300606NaNNaN20192020
400610NaNNaN20192020
..................
3311599923NaNNaN20192020
3311699925NaNNaN20192020
3311799926NaNNaN20192020
3311899927NaNNaN20192020
3311999929NaNNaN20192020
\n", "

33120 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year\n", "0 00601 NaN NaN 2019 2020\n", "1 00602 NaN NaN 2019 2020\n", "2 00603 260300.0 31083.0 2019 2020\n", "3 00606 NaN NaN 2019 2020\n", "4 00610 NaN NaN 2019 2020\n", "... ... ... ... ... ...\n", "33115 99923 NaN NaN 2019 2020\n", "33116 99925 NaN NaN 2019 2020\n", "33117 99926 NaN NaN 2019 2020\n", "33118 99927 NaN NaN 2019 2020\n", "33119 99929 NaN NaN 2019 2020\n", "\n", "[33120 rows x 5 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new table for only ZCTA5 and the 2019_Or_Later buckets\n", "# Create a new column for the first year in the bucket\n", "# Create a new column for the last year in the bucket (in this case we do not have data after 2019 so end in 2020)\n", "# Rename the columns\n", "Trans1 = Merge2.filter(['ZCTA5', '2019_Or_Later_EST', '2019_Or_Later_MOE' ], axis=1)\n", "Trans1['Start_Year'] = 2019\n", "Trans1['End_Year'] = 2020\n", "Trans1.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n", "\n", "Trans1" ] }, { "cell_type": "code", "execution_count": 46, "id": "00f89984", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_Year
00060182300.010863.020152018
100602144600.045088.020152018
200603134400.057761.020152018
30060658200.025881.020152018
400610115900.026647.020152018
..................
3311599923NaNNaN20152018
3311699925263900.031175.020152018
3311799926170300.020982.020152018
3311899927NaNNaN20152018
3311999929196900.052061.020152018
\n", "

33120 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year\n", "0 00601 82300.0 10863.0 2015 2018\n", "1 00602 144600.0 45088.0 2015 2018\n", "2 00603 134400.0 57761.0 2015 2018\n", "3 00606 58200.0 25881.0 2015 2018\n", "4 00610 115900.0 26647.0 2015 2018\n", "... ... ... ... ... ...\n", "33115 99923 NaN NaN 2015 2018\n", "33116 99925 263900.0 31175.0 2015 2018\n", "33117 99926 170300.0 20982.0 2015 2018\n", "33118 99927 NaN NaN 2015 2018\n", "33119 99929 196900.0 52061.0 2015 2018\n", "\n", "[33120 rows x 5 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as above except 2015 start and 2018 end\n", "# Rename the columns same as above to allow matching for union later\n", "Trans2 = Merge2.filter(['ZCTA5', '2015_To_2018_EST', '2015_To_2018_MOE'])\n", "Trans2['Start_Year'] = 2015\n", "Trans2['End_Year'] = 2018\n", "Trans2.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n", "\n", "Trans2" ] }, { "cell_type": "code", "execution_count": 47, "id": "b65cdcaa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_Year
00060181700.029876.020102014
100602116100.015512.020102014
200603130400.013058.020102014
30060697000.037195.020102014
400610110500.015103.020102014
..................
3311599923NaNNaN20102014
3311699925339200.033706.020102014
3311799926243800.0162405.020102014
3311899927NaNNaN20102014
3311999929178800.030051.020102014
\n", "

33120 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year\n", "0 00601 81700.0 29876.0 2010 2014\n", "1 00602 116100.0 15512.0 2010 2014\n", "2 00603 130400.0 13058.0 2010 2014\n", "3 00606 97000.0 37195.0 2010 2014\n", "4 00610 110500.0 15103.0 2010 2014\n", "... ... ... ... ... ...\n", "33115 99923 NaN NaN 2010 2014\n", "33116 99925 339200.0 33706.0 2010 2014\n", "33117 99926 243800.0 162405.0 2010 2014\n", "33118 99927 NaN NaN 2010 2014\n", "33119 99929 178800.0 30051.0 2010 2014\n", "\n", "[33120 rows x 5 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as above except for 2010 start and 2014 end\n", "Trans3 = Merge2.filter(['ZCTA5', '2010_To_2014_EST', '2010_To_2014_MOE'])\n", "Trans3['Start_Year'] = 2010\n", "Trans3['End_Year'] = 2014\n", "Trans3.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n", "\n", "Trans3" ] }, { "cell_type": "code", "execution_count": 48, "id": "bd0d82c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_Year
00060195100.020917.020002009
100602112200.020066.020002009
200603124100.012732.020002009
30060692900.014438.020002009
400610112000.014945.020002009
..................
3311599923NaNNaN20002009
3311699925181300.019820.020002009
3311799926154400.031403.020002009
3311899927NaNNaN20002009
3311999929231900.093901.020002009
\n", "

33120 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year\n", "0 00601 95100.0 20917.0 2000 2009\n", "1 00602 112200.0 20066.0 2000 2009\n", "2 00603 124100.0 12732.0 2000 2009\n", "3 00606 92900.0 14438.0 2000 2009\n", "4 00610 112000.0 14945.0 2000 2009\n", "... ... ... ... ... ...\n", "33115 99923 NaN NaN 2000 2009\n", "33116 99925 181300.0 19820.0 2000 2009\n", "33117 99926 154400.0 31403.0 2000 2009\n", "33118 99927 NaN NaN 2000 2009\n", "33119 99929 231900.0 93901.0 2000 2009\n", "\n", "[33120 rows x 5 columns]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Same as above except for 2000 start and 2009 end\n", "Trans4 = Merge2.filter(['ZCTA5', '2000_To_2009_EST', '2000_To_2009_MOE'])\n", "Trans4['Start_Year'] = 2000\n", "Trans4['End_Year'] = 2009\n", "Trans4.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']\n", "\n", "Trans4" ] }, { "cell_type": "code", "execution_count": 50, "id": "30490dda", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_Year
00060195100.020917.020002009
00060181700.029876.020102014
00060182300.010863.020152018
000601NaNNaN20192020
100602112200.020066.020002009
..................
3311899927NaNNaN20192020
3311999929231900.093901.020002009
3311999929178800.030051.020102014
3311999929196900.052061.020152018
3311999929NaNNaN20192020
\n", "

132480 rows × 5 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year\n", "0 00601 95100.0 20917.0 2000 2009\n", "0 00601 81700.0 29876.0 2010 2014\n", "0 00601 82300.0 10863.0 2015 2018\n", "0 00601 NaN NaN 2019 2020\n", "1 00602 112200.0 20066.0 2000 2009\n", "... ... ... ... ... ...\n", "33118 99927 NaN NaN 2019 2020\n", "33119 99929 231900.0 93901.0 2000 2009\n", "33119 99929 178800.0 30051.0 2010 2014\n", "33119 99929 196900.0 52061.0 2015 2018\n", "33119 99929 NaN NaN 2019 2020\n", "\n", "[132480 rows x 5 columns]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the 4 new dataframes as list 'frames' \n", "# Bring all data back together by concatonating our frames list together. Sort the data by ZCTA and Year\n", "frames = [Trans1, Trans2, Trans3, Trans4]\n", "frames2 = pd.concat(frames).sort_values(['ZCTA5', 'Start_Year'])\n", "\n", "frames2" ] }, { "cell_type": "markdown", "id": "cb878af7", "metadata": {}, "source": [ "### Break The Buckets To Individual Years\n", "Now we could do a SQL query for Year Between Star_Year and End_Year, but we want to be able to query for a single row and single year to be consistent with some of the other tables in the database. To do this, we will use the lambda function to create a column with a list of all the years in each bucket. Then we use explode to create a row for each Year in that new Year list." ] }, { "cell_type": "code", "execution_count": 52, "id": "347b2179", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORStart_YearEnd_YearYear
00060195100.020917.020002009[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
00060181700.029876.020102014[2010, 2011, 2012, 2013, 2014]
00060182300.010863.020152018[2015, 2016, 2017, 2018]
000601NaNNaN20192020[2019, 2020]
100602112200.020066.020002009[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
.....................
3311899927NaNNaN20192020[2019, 2020]
3311999929231900.093901.020002009[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
3311999929178800.030051.020102014[2010, 2011, 2012, 2013, 2014]
3311999929196900.052061.020152018[2015, 2016, 2017, 2018]
3311999929NaNNaN20192020[2019, 2020]
\n", "

132480 rows × 6 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year \\\n", "0 00601 95100.0 20917.0 2000 2009 \n", "0 00601 81700.0 29876.0 2010 2014 \n", "0 00601 82300.0 10863.0 2015 2018 \n", "0 00601 NaN NaN 2019 2020 \n", "1 00602 112200.0 20066.0 2000 2009 \n", "... ... ... ... ... ... \n", "33118 99927 NaN NaN 2019 2020 \n", "33119 99929 231900.0 93901.0 2000 2009 \n", "33119 99929 178800.0 30051.0 2010 2014 \n", "33119 99929 196900.0 52061.0 2015 2018 \n", "33119 99929 NaN NaN 2019 2020 \n", "\n", " Year \n", "0 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... \n", "0 [2010, 2011, 2012, 2013, 2014] \n", "0 [2015, 2016, 2017, 2018] \n", "0 [2019, 2020] \n", "1 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... \n", "... ... \n", "33118 [2019, 2020] \n", "33119 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... \n", "33119 [2010, 2011, 2012, 2013, 2014] \n", "33119 [2015, 2016, 2017, 2018] \n", "33119 [2019, 2020] \n", "\n", "[132480 rows x 6 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create list of all years in the bucket\n", "frames2['Year'] = frames2.apply(lambda x: list(range(x['Start_Year'], x['End_Year']+1)), axis =1)\n", "\n", "frames2" ] }, { "cell_type": "code", "execution_count": 55, "id": "b914585d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5ESTIMATEMARGIN_OF_ERRORYear
00060195100.020917.02000
00060195100.020917.02001
00060195100.020917.02002
00060195100.020917.02003
00060195100.020917.02004
...............
3311999929178800.030051.02014
3311999929196900.052061.02015
3311999929196900.052061.02016
3311999929196900.052061.02017
3311999929196900.052061.02018
\n", "

501841 rows × 4 columns

\n", "
" ], "text/plain": [ " ZCTA5 ESTIMATE MARGIN_OF_ERROR Year\n", "0 00601 95100.0 20917.0 2000\n", "0 00601 95100.0 20917.0 2001\n", "0 00601 95100.0 20917.0 2002\n", "0 00601 95100.0 20917.0 2003\n", "0 00601 95100.0 20917.0 2004\n", "... ... ... ... ...\n", "33119 99929 178800.0 30051.0 2014\n", "33119 99929 196900.0 52061.0 2015\n", "33119 99929 196900.0 52061.0 2016\n", "33119 99929 196900.0 52061.0 2017\n", "33119 99929 196900.0 52061.0 2018\n", "\n", "[501841 rows x 4 columns]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Explode the Year list to new columns and drop the now unnecessary start and end year columns\n", "# Drop any rows where there is no estimate. If there is no data, there is no reason for it to be in the databse.\n", "final = frames2.explode('Year').drop(columns=['Start_Year','End_Year'])\n", "final = final.dropna(subset = ['ESTIMATE'])\n", "\n", "final" ] }, { "cell_type": "markdown", "id": "58e0e9bd", "metadata": {}, "source": [ "### Next Steps - Zip to ZCTA\n", "Now that we have our data in a format that matches some of the other tables in the database, we need to be able to match this ZCTA level table to Zip Code level tables. To do this, we will use the Zip to ZCTA Crosswalk provided from the link below. This will allow us to write a Group By statement in a SQL query that will convert Zip level data to ZCTA level data that can then be joined to our ZCTA level data. This table also has the added benefit of including the City and State name for each Zip.\n", "\n", "https://udsmapper.org/zip-code-to-zcta-crosswalk/" ] }, { "cell_type": "code", "execution_count": 3, "id": "c2715ccc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZIP_CODEPO_NAMESTATEZIP_TYPEZCTAzip_join_type
000501HoltsvilleNYPost Office or large volume customer11742Spatial join to ZCTA
100544HoltsvilleNYPost Office or large volume customer11742Spatial join to ZCTA
200601AdjuntasPRZip Code Area00601Zip matches ZCTA
300602AguadaPRZip Code Area00602Zip matches ZCTA
400603AguadillaPRZip Code Area00603Zip matches ZCTA
.....................
4108699926MetlakatlaAKZip Code Area99926Zip matches ZCTA
4108799927Point BakerAKZip Code Area99927Zip matches ZCTA
4108899928Ward CoveAKPost Office or large volume customer99901Spatial join to ZCTA
4108999929WrangellAKZip Code Area99929Zip matches ZCTA
4109099950KetchikanAKZip Code Area99921Spatial join to ZCTA
\n", "

41091 rows × 6 columns

\n", "
" ], "text/plain": [ " ZIP_CODE PO_NAME STATE ZIP_TYPE \\\n", "0 00501 Holtsville NY Post Office or large volume customer \n", "1 00544 Holtsville NY Post Office or large volume customer \n", "2 00601 Adjuntas PR Zip Code Area \n", "3 00602 Aguada PR Zip Code Area \n", "4 00603 Aguadilla PR Zip Code Area \n", "... ... ... ... ... \n", "41086 99926 Metlakatla AK Zip Code Area \n", "41087 99927 Point Baker AK Zip Code Area \n", "41088 99928 Ward Cove AK Post Office or large volume customer \n", "41089 99929 Wrangell AK Zip Code Area \n", "41090 99950 Ketchikan AK Zip Code Area \n", "\n", " ZCTA zip_join_type \n", "0 11742 Spatial join to ZCTA \n", "1 11742 Spatial join to ZCTA \n", "2 00601 Zip matches ZCTA \n", "3 00602 Zip matches ZCTA \n", "4 00603 Zip matches ZCTA \n", "... ... ... \n", "41086 99926 Zip matches ZCTA \n", "41087 99927 Zip matches ZCTA \n", "41088 99901 Spatial join to ZCTA \n", "41089 99929 Zip matches ZCTA \n", "41090 99921 Spatial join to ZCTA \n", "\n", "[41091 rows x 6 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Specify each column as str on import or it will drop the leading zeros that we need.\n", "df = pd.read_excel(\"contrib_data/ZIPCodetoZCTACrosswalk2021UDS.xlsx\", dtype = {'ZIP_CODE': str, 'PO_NAME': str, 'STATE': str, 'ZIP_TYPE': str, 'ZCTA': str, 'zip_join_type': str})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "2973105e", "metadata": {}, "source": [ "### Ready to Import to Database\n", "Both of the tables are now ready to import into the database. Refer to the P4Analytics page on different methods to connect and import data.\n", "### Get Querying\n", "The following query is an example of how we might use some of the data we have. Lets say we wanted to look at the average income and the median home sale price in all Little Rock AR Zips for 2018. Our income data is at the Zip Code level and the Median housing data as at the Zip code level. Because of the 2 different levels of data, we will first need to group the Zip code level income data into ZCTAs. We can then use this Income ZCTA table as a subquery and join it to ZCTA median housing data. Because we know that the ZCTA is associated with the most frequent Zip, we can also join it back to the Zip to ZCTA conversion table to get the City and State of that Zip/ZCTA and filter the results." ] }, { "cell_type": "code", "execution_count": null, "id": "4d2c860d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pypyodbc as pyodbc\n", "#Change the values below to your personal and server credentials\n", "\n", "svr = 'the_server'\n", "uid = 'your_user_id'\n", "pwd = 'your_password'\n", "db = 'the_database'\n", "\n", "cnxn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+svr+';DATABASE='+db+';UID='+uid+';PWD='+ pwd+';TrustServerCertificate=yes;')\n", "\n", "query = \"select d.PO_NAME as City \\\n", ", d.STATE \\\n", ", INCOME.* \\\n", ", c.ESTIMATE as MEDIAN_HOUSING_ESTIMATE \\\n", ", c.ESTIMATE - c.MARGIN_OF_ERROR as MEDIAN_HOUSE_LB \\\n", ", c.ESTIMATE + c.MARGIN_OF_ERROR as MEDIAN_HOUSE_UB \\\n", "from(select b.ZCTA \\\n", ", sum(a.Number_Of_Returns) as NBR_RTRNS \\\n", ", floor(avg(a.Total_Income_Amount)) as AVG_INCOME \\\n", ", a.Date_Year \\\n", "from IRS_ZIPCODE_SOURCE a \\\n", "join ZIP_TO_ZCTA b \\\n", "on a.Zipcode = b.ZIP_CODE \\\n", "group by b.ZCTA, a.Date_Year) as INCOME \\\n", "join MEDIAN_HOUSE_VALUE c \\\n", "on INCOME.ZCTA = c.ZCTA5 \\\n", "and INCOME.Date_Year = c.Year \\\n", "join ZIP_TO_ZCTA d \\\n", "on c.ZCTA5 = d.ZIP_CODE \\\n", "where d.STATE = 'AR' \\\n", "and d.PO_NAME = 'Little Rock' \\\n", "and INCOME.Date_Year = 2018\"\n", "\n", "df = pd.read_sql(query, cnxn)\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe" } } }, "nbformat": 4, "nbformat_minor": 5 }