US Census Bureau Median Housing Data
Contents
US Census Bureau Median Housing Data#
The US Census Bureau publicly provides an extensive amount of information that it obtains. Sifting through and finding the right data you’re looking for is difficult enough. Getting that data into a usable format that can then be uploaded to an RDBMS is something else entirely. This guide will walk us through how to retrieve the data, how to clean it and modify it within Python, and then how to export it to an Excel or CSV file that can be uploaded to an RDBMS. This is the same process used to create the MEDIAN_HOUSE_VALUE table on the USZIPCODE database for the University.
ZCTA Vs Zip Code#
For this example we will be using The Census Bureau derived ZCTA and not actual Zip Codes. You can read up on ZCTA more in the link below. We will also create a table that will allow us to convert and group Zip Codes to ZCTA so we can compare data in either format.
Census Bureau ZCTAs
TL:DR
ZCTA combine Zip Codes with very few addresses and assign the most frequent Zip within that grouping as the ZCTA.
Obtaining The Data#
1.) Navigate to https://data.census.gov/cedsci/table. Here you can search for all of the different information that the US Census Bureau publicy provides. For this example, search for “B25109” or “Median Value by Year Householder Moved Into Unit” to use the same dataset, or another table of interest that allows you to break out that data by ZCTA5 code in the next step.
2.) Click the “Geos” button in the table header and navigate to “Zip Code Tabulation Area” (we will refer to this as ZCTA from here on.) If you want to work with a smaller subset of data first, select a single state and then All ZCTA for that state. Or select the top button for all ZCTA in the US. This will likely take some time since the table is roughly 100,000 rows. Hit the “Transpose” button to format the data so that ZCTA, Estimate, and Margin of Error are the rows and the year ranges are the columns.
Note
As stated in the first step, not all tables will allow you to break out the data into ZCTA. While some of the steps can be used for data outside of ZCTA level data, this guide is explicitly for cleaning ZCTA data.
3.) The first column is now giving us a median of the medians and should be removed since we can do this calculation easily later if we wanted to. I also removed the final two columns of data before 2000, but these can be kept if you are interested in historical data going that far back. To remove these columns, click "Columns" on the far right side of the website and de-select the top two buttons with no title next to them, and the final two buttons for data before 2000.
4.) Press the CSV button on the top header to export your dataset to a CSV and file and wait for it to download. This will likely take some time since it is almost 100,000 rows.
Importing The Data#
We’re finally ready to start preparing our data! It is a good idea to move the downloaded file to the same folder you want the Jupyter Notebook or Python program to be saved in. I also renamed the file to easily identify what it was and easily read it into Jupyter.
Execute the following cell looking at both the head and the tail.
import pandas as pd
import numpy as np
df = pd.read_csv("contrib_data/US_Housing_by_zip.csv")
df.head(10)
#df.tail(10)
Label (Grouping) | Median value --!!Total:!!Moved in 2019 or later | Median value --!!Total:!!Moved in 2015 to 2018 | Median value --!!Total:!!Moved in 2010 to 2014 | Median value --!!Total:!!Moved in 2000 to 2009 | |
---|---|---|---|---|---|
0 | ZCTA5 00601 | NaN | NaN | NaN | NaN |
1 | Estimate | - | 82,300 | 81,700 | 95,100 |
2 | Margin of Error | ** | ±10,863 | ±29,876 | ±20,917 |
3 | ZCTA5 00602 | NaN | NaN | NaN | NaN |
4 | Estimate | - | 144,600 | 116,100 | 112,200 |
5 | Margin of Error | ** | ±45,088 | ±15,512 | ±20,066 |
6 | ZCTA5 00603 | NaN | NaN | NaN | NaN |
7 | Estimate | 260,300 | 134,400 | 130,400 | 124,100 |
8 | Margin of Error | ±31,083 | ±57,761 | ±13,058 | ±12,732 |
9 | ZCTA5 00606 | NaN | NaN | NaN | NaN |
Cleaning the Data#
To put it simply, this data is ugly. The current format is essentially unqueryable. The data does follow a recognizable pattern though.
Every 3 rows are ZCTA #, Estimate, Margin of Error with the Estimate and Margin of Error below it relatting to the ZCTA above
The ZCTA row always starts with ‘ZCTA5’ and has no values.
Margin of Error always starts with ± when it has a value.
One other thing to note is the final 3 rows are an aggregate of all United States data. Since there is no single ZCTA associated with this row and we can easily run this aggregate with a query when we are finished, we will want to drop these last 3 rows.
To get this data into a usable format we will:
Rename the columns for easier management and querying
This isn’t necessary, but like working with SQL aliases this makes some things much easier
Remove all ± and other special characters
This will help with point 5 and is an easy cleaning step before processing
Drop the last 3 rows for reasons stated above
Match each ‘Estimate’ and ‘Margin of Error’ with the appropriate ZCTA
This will get the data for each ZCTA into a single row
Ensure each column is the proper data type
Zip and ACTA can have leading zeros. We need to ensure they import with these zeros for joining
Break out the year buckets into individual years
This will make querying more intuitive and allow for easier appending when new data for years becomes available
Drop any rows in the final dataframe that have no values
These steps will produce a final dataframe that is easy to query, easy to manage, and match the schema of the other tables in the database. We will be taking the above, and transform it to the format below.
ZCTA |
Estimate |
Margin_of_Error |
Year |
---|---|---|---|
00601 |
82300 |
10863 |
2015 |
00601 |
81700 |
29876 |
2014 |
Lets get started. Execute the following cell.
# Rename the columns
# Remove all of the ±
# Drop the last 3 rows
df2 =df
df2.columns = ['ZCTA5', '2019_Or_Later', '2015_To_2018', '2010_To_2014', '2000_To_2009']
df2 = df2.replace('^\±', '', regex=True)
df2.drop(df2.tail(3).index,inplace=True)
df2
ZCTA5 | 2019_Or_Later | 2015_To_2018 | 2010_To_2014 | 2000_To_2009 | |
---|---|---|---|---|---|
0 | ZCTA5 00601 | NaN | NaN | NaN | NaN |
1 | Estimate | - | 82,300 | 81,700 | 95,100 |
2 | Margin of Error | ** | 10,863 | 29,876 | 20,917 |
3 | ZCTA5 00602 | NaN | NaN | NaN | NaN |
4 | Estimate | - | 144,600 | 116,100 | 112,200 |
... | ... | ... | ... | ... | ... |
99355 | Estimate | - | - | - | - |
99356 | Margin of Error | ** | ** | ** | ** |
99357 | ZCTA5 99929 | NaN | NaN | NaN | NaN |
99358 | Estimate | - | 196,900 | 178,800 | 231,900 |
99359 | Margin of Error | ** | 52,061 | 30,051 | 93,901 |
99360 rows × 5 columns
Break the data apart#
In the next steps we will create new dataframes for all ZCTA, Estimate, and Margin of Error rows. When we break these out into their separate tables they will retain their order. Even though each row will not directly have a ZCTA associated with it, we know that the first row will be ZCTA 00601, the second row ZCTA 00602 and so on based in the index because the rows retained their order. This will then allow us to merge these 3 dataframes back together based on their index giving us a single row for each ZCTA with the Estimate ‘EST’ and Margin of Error ‘MOE’ for each year bucket.
# Filter for only rows that start with 'ZCTA' and create a new dataframe. Retain the order and reindex the rows.
# Drop all columns except the first column because they contain no data
# Remove the 'ZCTA5 ' from each row so we are left with just the ZCTA numbers
ZCTA = df2[df2['ZCTA5'].str[:4]=='ZCTA'].copy().reset_index(drop=True)
ZCTA.drop(ZCTA.columns[[1,2,3,4]], axis=1, inplace=True)
ZCTA = ZCTA['ZCTA5'].str.split(' ').str[1]
ZCTA
0 00601
1 00602
2 00603
3 00606
4 00610
...
33115 99923
33116 99925
33117 99926
33118 99927
33119 99929
Name: ZCTA5, Length: 33120, dtype: object
# Filter the original dataframe for only rows that start with 'Estimate' and create a new dataframe. Retain the order and reindex the rows
# Rename the columns identifying them as EST columns for each year bucket
# Drop the first column because it would just say Estimate 33,000 times
EST = df2[df2['ZCTA5'].str.contains('Estimate', regex=False)].copy().reset_index(drop=True)
EST.columns = ['ZCTA5', '2019_Or_Later_EST', '2015_To_2018_EST', '2010_To_2014_EST', '2000_To_2009_EST']
EST.drop(EST.columns[[0]], axis=1, inplace=True)
EST
2019_Or_Later_EST | 2015_To_2018_EST | 2010_To_2014_EST | 2000_To_2009_EST | |
---|---|---|---|---|
0 | - | 82,300 | 81,700 | 95,100 |
1 | - | 144,600 | 116,100 | 112,200 |
2 | 260,300 | 134,400 | 130,400 | 124,100 |
3 | - | 58,200 | 97,000 | 92,900 |
4 | - | 115,900 | 110,500 | 112,000 |
... | ... | ... | ... | ... |
33115 | - | - | - | - |
33116 | - | 263,900 | 339,200 | 181,300 |
33117 | - | 170,300 | 243,800 | 154,400 |
33118 | - | - | - | - |
33119 | - | 196,900 | 178,800 | 231,900 |
33120 rows × 4 columns
# Same process as above but for MOE
MOE = df2[df2['ZCTA5'].str.contains('Margin', regex=False)].copy().reset_index(drop=True)
MOE.columns = ['ZCTA5', '2019_Or_Later_MOE', '2015_To_2018_MOE', '2010_To_2014_MOE', '2000_To_2009_MOE']
MOE.drop(MOE.columns[[0]], axis=1, inplace=True)
MOE
2019_Or_Later_MOE | 2015_To_2018_MOE | 2010_To_2014_MOE | 2000_To_2009_MOE | |
---|---|---|---|---|
0 | ** | 10,863 | 29,876 | 20,917 |
1 | ** | 45,088 | 15,512 | 20,066 |
2 | 31,083 | 57,761 | 13,058 | 12,732 |
3 | ** | 25,881 | 37,195 | 14,438 |
4 | ** | 26,647 | 15,103 | 14,945 |
... | ... | ... | ... | ... |
33115 | ** | ** | ** | ** |
33116 | ** | 31,175 | 33,706 | 19,820 |
33117 | ** | 20,982 | 162,405 | 31,403 |
33118 | ** | ** | ** | ** |
33119 | ** | 52,061 | 30,051 | 93,901 |
33120 rows × 4 columns
Rejoin the data#
Here we will first rejoin the EST and MOE tables to clean them before joining back to the ZCTAs. Since we want to keep ZCTA in ‘object’ or text format, not having it with the other data will allow us to clean and put the rest of the data that we do want in a numeric format easily.
# Join the EST and MOE tables back together on the index
# Remove commas because it seems to confuse pandas and doesn't allow conversion to numeric
# Remove other special characters and replace with nan
# Convert all data in the dataframe to numeric
Merge = pd.concat([EST, MOE], axis=1)
Merge = Merge.replace(',','', regex=True)
Merge = Merge.replace({'-': np.nan, '**': np.nan, '***': np.nan})
Merge = Merge.apply(pd.to_numeric, errors='coerce')
Merge
#Merge.dtypes
2019_Or_Later_EST | 2015_To_2018_EST | 2010_To_2014_EST | 2000_To_2009_EST | 2019_Or_Later_MOE | 2015_To_2018_MOE | 2010_To_2014_MOE | 2000_To_2009_MOE | |
---|---|---|---|---|---|---|---|---|
0 | NaN | 82300.0 | 81700.0 | 95100.0 | NaN | 10863.0 | 29876.0 | 20917.0 |
1 | NaN | 144600.0 | 116100.0 | 112200.0 | NaN | 45088.0 | 15512.0 | 20066.0 |
2 | 260300.0 | 134400.0 | 130400.0 | 124100.0 | 31083.0 | 57761.0 | 13058.0 | 12732.0 |
3 | NaN | 58200.0 | 97000.0 | 92900.0 | NaN | 25881.0 | 37195.0 | 14438.0 |
4 | NaN | 115900.0 | 110500.0 | 112000.0 | NaN | 26647.0 | 15103.0 | 14945.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
33115 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
33116 | NaN | 263900.0 | 339200.0 | 181300.0 | NaN | 31175.0 | 33706.0 | 19820.0 |
33117 | NaN | 170300.0 | 243800.0 | 154400.0 | NaN | 20982.0 | 162405.0 | 31403.0 |
33118 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
33119 | NaN | 196900.0 | 178800.0 | 231900.0 | NaN | 52061.0 | 30051.0 | 93901.0 |
33120 rows × 8 columns
Note
Note that dtypes are float and we want int in the database. This is fine since the target table in the database will be int and that is how the data will import
# Join the data with its resepctive ZCTA
Merge2 = pd.concat([ZCTA, Merge], axis=1)
Merge2
#Merge2.dtypes
ZCTA5 | 2019_Or_Later_EST | 2015_To_2018_EST | 2010_To_2014_EST | 2000_To_2009_EST | 2019_Or_Later_MOE | 2015_To_2018_MOE | 2010_To_2014_MOE | 2000_To_2009_MOE | |
---|---|---|---|---|---|---|---|---|---|
0 | 00601 | NaN | 82300.0 | 81700.0 | 95100.0 | NaN | 10863.0 | 29876.0 | 20917.0 |
1 | 00602 | NaN | 144600.0 | 116100.0 | 112200.0 | NaN | 45088.0 | 15512.0 | 20066.0 |
2 | 00603 | 260300.0 | 134400.0 | 130400.0 | 124100.0 | 31083.0 | 57761.0 | 13058.0 | 12732.0 |
3 | 00606 | NaN | 58200.0 | 97000.0 | 92900.0 | NaN | 25881.0 | 37195.0 | 14438.0 |
4 | 00610 | NaN | 115900.0 | 110500.0 | 112000.0 | NaN | 26647.0 | 15103.0 | 14945.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
33115 | 99923 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
33116 | 99925 | NaN | 263900.0 | 339200.0 | 181300.0 | NaN | 31175.0 | 33706.0 | 19820.0 |
33117 | 99926 | NaN | 170300.0 | 243800.0 | 154400.0 | NaN | 20982.0 | 162405.0 | 31403.0 |
33118 | 99927 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
33119 | 99929 | NaN | 196900.0 | 178800.0 | 231900.0 | NaN | 52061.0 | 30051.0 | 93901.0 |
33120 rows × 9 columns
Note
Note that ZCTA5 is dtype object and retained the leading zeros like we wanted
Break The Buckets Into Start and End Years#
Since querying for an individual year is more intuitive than querying a column bucket year range, we will now break out each each bucket to its respective years. First, we will create tables with new columns for beginning and ending years of the buckets for year range. Then we will do the equivalent of a SQL UNION to bring them all back together.
# Create a new table for only ZCTA5 and the 2019_Or_Later buckets
# Create a new column for the first year in the bucket
# Create a new column for the last year in the bucket (in this case we do not have data after 2019 so end in 2020)
# Rename the columns
Trans1 = Merge2.filter(['ZCTA5', '2019_Or_Later_EST', '2019_Or_Later_MOE' ], axis=1)
Trans1['Start_Year'] = 2019
Trans1['End_Year'] = 2020
Trans1.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']
Trans1
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | |
---|---|---|---|---|---|
0 | 00601 | NaN | NaN | 2019 | 2020 |
1 | 00602 | NaN | NaN | 2019 | 2020 |
2 | 00603 | 260300.0 | 31083.0 | 2019 | 2020 |
3 | 00606 | NaN | NaN | 2019 | 2020 |
4 | 00610 | NaN | NaN | 2019 | 2020 |
... | ... | ... | ... | ... | ... |
33115 | 99923 | NaN | NaN | 2019 | 2020 |
33116 | 99925 | NaN | NaN | 2019 | 2020 |
33117 | 99926 | NaN | NaN | 2019 | 2020 |
33118 | 99927 | NaN | NaN | 2019 | 2020 |
33119 | 99929 | NaN | NaN | 2019 | 2020 |
33120 rows × 5 columns
# Same as above except 2015 start and 2018 end
# Rename the columns same as above to allow matching for union later
Trans2 = Merge2.filter(['ZCTA5', '2015_To_2018_EST', '2015_To_2018_MOE'])
Trans2['Start_Year'] = 2015
Trans2['End_Year'] = 2018
Trans2.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']
Trans2
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | |
---|---|---|---|---|---|
0 | 00601 | 82300.0 | 10863.0 | 2015 | 2018 |
1 | 00602 | 144600.0 | 45088.0 | 2015 | 2018 |
2 | 00603 | 134400.0 | 57761.0 | 2015 | 2018 |
3 | 00606 | 58200.0 | 25881.0 | 2015 | 2018 |
4 | 00610 | 115900.0 | 26647.0 | 2015 | 2018 |
... | ... | ... | ... | ... | ... |
33115 | 99923 | NaN | NaN | 2015 | 2018 |
33116 | 99925 | 263900.0 | 31175.0 | 2015 | 2018 |
33117 | 99926 | 170300.0 | 20982.0 | 2015 | 2018 |
33118 | 99927 | NaN | NaN | 2015 | 2018 |
33119 | 99929 | 196900.0 | 52061.0 | 2015 | 2018 |
33120 rows × 5 columns
# Same as above except for 2010 start and 2014 end
Trans3 = Merge2.filter(['ZCTA5', '2010_To_2014_EST', '2010_To_2014_MOE'])
Trans3['Start_Year'] = 2010
Trans3['End_Year'] = 2014
Trans3.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']
Trans3
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | |
---|---|---|---|---|---|
0 | 00601 | 81700.0 | 29876.0 | 2010 | 2014 |
1 | 00602 | 116100.0 | 15512.0 | 2010 | 2014 |
2 | 00603 | 130400.0 | 13058.0 | 2010 | 2014 |
3 | 00606 | 97000.0 | 37195.0 | 2010 | 2014 |
4 | 00610 | 110500.0 | 15103.0 | 2010 | 2014 |
... | ... | ... | ... | ... | ... |
33115 | 99923 | NaN | NaN | 2010 | 2014 |
33116 | 99925 | 339200.0 | 33706.0 | 2010 | 2014 |
33117 | 99926 | 243800.0 | 162405.0 | 2010 | 2014 |
33118 | 99927 | NaN | NaN | 2010 | 2014 |
33119 | 99929 | 178800.0 | 30051.0 | 2010 | 2014 |
33120 rows × 5 columns
# Same as above except for 2000 start and 2009 end
Trans4 = Merge2.filter(['ZCTA5', '2000_To_2009_EST', '2000_To_2009_MOE'])
Trans4['Start_Year'] = 2000
Trans4['End_Year'] = 2009
Trans4.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']
Trans4
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | |
---|---|---|---|---|---|
0 | 00601 | 95100.0 | 20917.0 | 2000 | 2009 |
1 | 00602 | 112200.0 | 20066.0 | 2000 | 2009 |
2 | 00603 | 124100.0 | 12732.0 | 2000 | 2009 |
3 | 00606 | 92900.0 | 14438.0 | 2000 | 2009 |
4 | 00610 | 112000.0 | 14945.0 | 2000 | 2009 |
... | ... | ... | ... | ... | ... |
33115 | 99923 | NaN | NaN | 2000 | 2009 |
33116 | 99925 | 181300.0 | 19820.0 | 2000 | 2009 |
33117 | 99926 | 154400.0 | 31403.0 | 2000 | 2009 |
33118 | 99927 | NaN | NaN | 2000 | 2009 |
33119 | 99929 | 231900.0 | 93901.0 | 2000 | 2009 |
33120 rows × 5 columns
# Define the 4 new dataframes as list 'frames'
# Bring all data back together by concatonating our frames list together. Sort the data by ZCTA and Year
frames = [Trans1, Trans2, Trans3, Trans4]
frames2 = pd.concat(frames).sort_values(['ZCTA5', 'Start_Year'])
frames2
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | |
---|---|---|---|---|---|
0 | 00601 | 95100.0 | 20917.0 | 2000 | 2009 |
0 | 00601 | 81700.0 | 29876.0 | 2010 | 2014 |
0 | 00601 | 82300.0 | 10863.0 | 2015 | 2018 |
0 | 00601 | NaN | NaN | 2019 | 2020 |
1 | 00602 | 112200.0 | 20066.0 | 2000 | 2009 |
... | ... | ... | ... | ... | ... |
33118 | 99927 | NaN | NaN | 2019 | 2020 |
33119 | 99929 | 231900.0 | 93901.0 | 2000 | 2009 |
33119 | 99929 | 178800.0 | 30051.0 | 2010 | 2014 |
33119 | 99929 | 196900.0 | 52061.0 | 2015 | 2018 |
33119 | 99929 | NaN | NaN | 2019 | 2020 |
132480 rows × 5 columns
Break The Buckets To Individual Years#
Now we could do a SQL query for Year Between Star_Year and End_Year, but we want to be able to query for a single row and single year to be consistent with some of the other tables in the database. To do this, we will use the lambda function to create a column with a list of all the years in each bucket. Then we use explode to create a row for each Year in that new Year list.
# Create list of all years in the bucket
frames2['Year'] = frames2.apply(lambda x: list(range(x['Start_Year'], x['End_Year']+1)), axis =1)
frames2
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Start_Year | End_Year | Year | |
---|---|---|---|---|---|---|
0 | 00601 | 95100.0 | 20917.0 | 2000 | 2009 | [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... |
0 | 00601 | 81700.0 | 29876.0 | 2010 | 2014 | [2010, 2011, 2012, 2013, 2014] |
0 | 00601 | 82300.0 | 10863.0 | 2015 | 2018 | [2015, 2016, 2017, 2018] |
0 | 00601 | NaN | NaN | 2019 | 2020 | [2019, 2020] |
1 | 00602 | 112200.0 | 20066.0 | 2000 | 2009 | [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... |
... | ... | ... | ... | ... | ... | ... |
33118 | 99927 | NaN | NaN | 2019 | 2020 | [2019, 2020] |
33119 | 99929 | 231900.0 | 93901.0 | 2000 | 2009 | [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200... |
33119 | 99929 | 178800.0 | 30051.0 | 2010 | 2014 | [2010, 2011, 2012, 2013, 2014] |
33119 | 99929 | 196900.0 | 52061.0 | 2015 | 2018 | [2015, 2016, 2017, 2018] |
33119 | 99929 | NaN | NaN | 2019 | 2020 | [2019, 2020] |
132480 rows × 6 columns
# Explode the Year list to new columns and drop the now unnecessary start and end year columns
# Drop any rows where there is no estimate. If there is no data, there is no reason for it to be in the databse.
final = frames2.explode('Year').drop(columns=['Start_Year','End_Year'])
final = final.dropna(subset = ['ESTIMATE'])
final
ZCTA5 | ESTIMATE | MARGIN_OF_ERROR | Year | |
---|---|---|---|---|
0 | 00601 | 95100.0 | 20917.0 | 2000 |
0 | 00601 | 95100.0 | 20917.0 | 2001 |
0 | 00601 | 95100.0 | 20917.0 | 2002 |
0 | 00601 | 95100.0 | 20917.0 | 2003 |
0 | 00601 | 95100.0 | 20917.0 | 2004 |
... | ... | ... | ... | ... |
33119 | 99929 | 178800.0 | 30051.0 | 2014 |
33119 | 99929 | 196900.0 | 52061.0 | 2015 |
33119 | 99929 | 196900.0 | 52061.0 | 2016 |
33119 | 99929 | 196900.0 | 52061.0 | 2017 |
33119 | 99929 | 196900.0 | 52061.0 | 2018 |
501841 rows × 4 columns
Next Steps - Zip to ZCTA#
Now that we have our data in a format that matches some of the other tables in the database, we need to be able to match this ZCTA level table to Zip Code level tables. To do this, we will use the Zip to ZCTA Crosswalk provided from the link below. This will allow us to write a Group By statement in a SQL query that will convert Zip level data to ZCTA level data that can then be joined to our ZCTA level data. This table also has the added benefit of including the City and State name for each Zip.
https://udsmapper.org/zip-code-to-zcta-crosswalk/
import pandas as pd
import numpy as np
# Specify each column as str on import or it will drop the leading zeros that we need.
df = pd.read_excel("contrib_data/ZIPCodetoZCTACrosswalk2021UDS.xlsx", dtype = {'ZIP_CODE': str, 'PO_NAME': str, 'STATE': str, 'ZIP_TYPE': str, 'ZCTA': str, 'zip_join_type': str})
df
ZIP_CODE | PO_NAME | STATE | ZIP_TYPE | ZCTA | zip_join_type | |
---|---|---|---|---|---|---|
0 | 00501 | Holtsville | NY | Post Office or large volume customer | 11742 | Spatial join to ZCTA |
1 | 00544 | Holtsville | NY | Post Office or large volume customer | 11742 | Spatial join to ZCTA |
2 | 00601 | Adjuntas | PR | Zip Code Area | 00601 | Zip matches ZCTA |
3 | 00602 | Aguada | PR | Zip Code Area | 00602 | Zip matches ZCTA |
4 | 00603 | Aguadilla | PR | Zip Code Area | 00603 | Zip matches ZCTA |
... | ... | ... | ... | ... | ... | ... |
41086 | 99926 | Metlakatla | AK | Zip Code Area | 99926 | Zip matches ZCTA |
41087 | 99927 | Point Baker | AK | Zip Code Area | 99927 | Zip matches ZCTA |
41088 | 99928 | Ward Cove | AK | Post Office or large volume customer | 99901 | Spatial join to ZCTA |
41089 | 99929 | Wrangell | AK | Zip Code Area | 99929 | Zip matches ZCTA |
41090 | 99950 | Ketchikan | AK | Zip Code Area | 99921 | Spatial join to ZCTA |
41091 rows × 6 columns
Ready to Import to Database#
Both of the tables are now ready to import into the database. Refer to the P4Analytics page on different methods to connect and import data.
Get Querying#
The following query is an example of how we might use some of the data we have. Lets say we wanted to look at the average income and the median home sale price in all Little Rock AR Zips for 2018. Our income data is at the Zip Code level and the Median housing data as at the Zip code level. Because of the 2 different levels of data, we will first need to group the Zip code level income data into ZCTAs. We can then use this Income ZCTA table as a subquery and join it to ZCTA median housing data. Because we know that the ZCTA is associated with the most frequent Zip, we can also join it back to the Zip to ZCTA conversion table to get the City and State of that Zip/ZCTA and filter the results.
import pandas as pd
import pypyodbc as pyodbc
#Change the values below to your personal and server credentials
svr = 'the_server'
uid = 'your_user_id'
pwd = 'your_password'
db = 'the_database'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+svr+';DATABASE='+db+';UID='+uid+';PWD='+ pwd+';TrustServerCertificate=yes;')
query = "select d.PO_NAME as City \
, d.STATE \
, INCOME.* \
, c.ESTIMATE as MEDIAN_HOUSING_ESTIMATE \
, c.ESTIMATE - c.MARGIN_OF_ERROR as MEDIAN_HOUSE_LB \
, c.ESTIMATE + c.MARGIN_OF_ERROR as MEDIAN_HOUSE_UB \
from(select b.ZCTA \
, sum(a.Number_Of_Returns) as NBR_RTRNS \
, floor(avg(a.Total_Income_Amount)) as AVG_INCOME \
, a.Date_Year \
from IRS_ZIPCODE_SOURCE a \
join ZIP_TO_ZCTA b \
on a.Zipcode = b.ZIP_CODE \
group by b.ZCTA, a.Date_Year) as INCOME \
join MEDIAN_HOUSE_VALUE c \
on INCOME.ZCTA = c.ZCTA5 \
and INCOME.Date_Year = c.Year \
join ZIP_TO_ZCTA d \
on c.ZCTA5 = d.ZIP_CODE \
where d.STATE = 'AR' \
and d.PO_NAME = 'Little Rock' \
and INCOME.Date_Year = 2018"
df = pd.read_sql(query, cnxn)
df