US Census Bureau Median Housing Data#

The US Census Bureau publicly provides an extensive amount of information that it obtains. Sifting through and finding the right data you’re looking for is difficult enough. Getting that data into a usable format that can then be uploaded to an RDBMS is something else entirely. This guide will walk us through how to retrieve the data, how to clean it and modify it within Python, and then how to export it to an Excel or CSV file that can be uploaded to an RDBMS. This is the same process used to create the MEDIAN_HOUSE_VALUE table on the USZIPCODE database for the University.

ZCTA Vs Zip Code#

For this example we will be using The Census Bureau derived ZCTA and not actual Zip Codes. You can read up on ZCTA more in the link below. We will also create a table that will allow us to convert and group Zip Codes to ZCTA so we can compare data in either format.
Census Bureau ZCTAs

TL:DR

ZCTA combine Zip Codes with very few addresses and assign the most frequent Zip within that grouping as the ZCTA.

Obtaining The Data#

1.) Navigate to https://data.census.gov/cedsci/table. Here you can search for all of the different information that the US Census Bureau publicy provides. For this example, search for “B25109” or “Median Value by Year Householder Moved Into Unit” to use the same dataset, or another table of interest that allows you to break out that data by ZCTA5 code in the next step.

2.) Click the “Geos” button in the table header and navigate to “Zip Code Tabulation Area” (we will refer to this as ZCTA from here on.) If you want to work with a smaller subset of data first, select a single state and then All ZCTA for that state. Or select the top button for all ZCTA in the US. This will likely take some time since the table is roughly 100,000 rows. Hit the “Transpose” button to format the data so that ZCTA, Estimate, and Margin of Error are the rows and the year ranges are the columns.

Note

As stated in the first step, not all tables will allow you to break out the data into ZCTA. While some of the steps can be used for data outside of ZCTA level data, this guide is explicitly for cleaning ZCTA data.

3.) The first column is now giving us a median of the medians and should be removed since we can do this calculation easily later if we wanted to. I also removed the final two columns of data before 2000, but these can be kept if you are interested in historical data going that far back. To remove these columns, click "Columns" on the far right side of the website and de-select the top two buttons with no title next to them, and the final two buttons for data before 2000.

4.) Press the CSV button on the top header to export your dataset to a CSV and file and wait for it to download. This will likely take some time since it is almost 100,000 rows.

Importing The Data#

We’re finally ready to start preparing our data! It is a good idea to move the downloaded file to the same folder you want the Jupyter Notebook or Python program to be saved in. I also renamed the file to easily identify what it was and easily read it into Jupyter.

Execute the following cell looking at both the head and the tail.

import pandas as pd
import numpy as np

df = pd.read_csv("contrib_data/US_Housing_by_zip.csv")

df.head(10)
#df.tail(10)

	Label (Grouping)	Median value --!!Total:!!Moved in 2019 or later	Median value --!!Total:!!Moved in 2015 to 2018	Median value --!!Total:!!Moved in 2010 to 2014	Median value --!!Total:!!Moved in 2000 to 2009
0	ZCTA5 00601	NaN	NaN	NaN	NaN
1	Estimate	-	82,300	81,700	95,100
2	Margin of Error	**	±10,863	±29,876	±20,917
3	ZCTA5 00602	NaN	NaN	NaN	NaN
4	Estimate	-	144,600	116,100	112,200
5	Margin of Error	**	±45,088	±15,512	±20,066
6	ZCTA5 00603	NaN	NaN	NaN	NaN
7	Estimate	260,300	134,400	130,400	124,100
8	Margin of Error	±31,083	±57,761	±13,058	±12,732
9	ZCTA5 00606	NaN	NaN	NaN	NaN

Cleaning the Data#

To put it simply, this data is ugly. The current format is essentially unqueryable. The data does follow a recognizable pattern though.

Every 3 rows are ZCTA #, Estimate, Margin of Error with the Estimate and Margin of Error below it relatting to the ZCTA above
The ZCTA row always starts with ‘ZCTA5’ and has no values.
Margin of Error always starts with ± when it has a value.

One other thing to note is the final 3 rows are an aggregate of all United States data. Since there is no single ZCTA associated with this row and we can easily run this aggregate with a query when we are finished, we will want to drop these last 3 rows.

To get this data into a usable format we will:

Rename the columns for easier management and querying
- This isn’t necessary, but like working with SQL aliases this makes some things much easier
Remove all ± and other special characters
- This will help with point 5 and is an easy cleaning step before processing
Drop the last 3 rows for reasons stated above
Match each ‘Estimate’ and ‘Margin of Error’ with the appropriate ZCTA
- This will get the data for each ZCTA into a single row
Ensure each column is the proper data type
- Zip and ACTA can have leading zeros. We need to ensure they import with these zeros for joining
Break out the year buckets into individual years
- This will make querying more intuitive and allow for easier appending when new data for years becomes available
Drop any rows in the final dataframe that have no values

These steps will produce a final dataframe that is easy to query, easy to manage, and match the schema of the other tables in the database. We will be taking the above, and transform it to the format below.

ZCTA	Estimate	Margin_of_Error	Year
00601	82300	10863	2015
00601	81700	29876	2014

Lets get started. Execute the following cell.

# Rename the columns 
# Remove all of the ±
# Drop the last 3 rows
df2 =df
df2.columns = ['ZCTA5', '2019_Or_Later', '2015_To_2018', '2010_To_2014', '2000_To_2009']
df2 = df2.replace('^\±', '', regex=True)
df2.drop(df2.tail(3).index,inplace=True)

df2

	ZCTA5	2019_Or_Later	2015_To_2018	2010_To_2014	2000_To_2009
0	ZCTA5 00601	NaN	NaN	NaN	NaN
1	Estimate	-	82,300	81,700	95,100
2	Margin of Error	**	10,863	29,876	20,917
3	ZCTA5 00602	NaN	NaN	NaN	NaN
4	Estimate	-	144,600	116,100	112,200
...	...	...	...	...	...
99355	Estimate	-	-	-	-
99356	Margin of Error	**	**	**	**
99357	ZCTA5 99929	NaN	NaN	NaN	NaN
99358	Estimate	-	196,900	178,800	231,900
99359	Margin of Error	**	52,061	30,051	93,901

99360 rows × 5 columns

Break the data apart#

In the next steps we will create new dataframes for all ZCTA, Estimate, and Margin of Error rows. When we break these out into their separate tables they will retain their order. Even though each row will not directly have a ZCTA associated with it, we know that the first row will be ZCTA 00601, the second row ZCTA 00602 and so on based in the index because the rows retained their order. This will then allow us to merge these 3 dataframes back together based on their index giving us a single row for each ZCTA with the Estimate ‘EST’ and Margin of Error ‘MOE’ for each year bucket.

# Filter for only rows that start with 'ZCTA' and create a new dataframe. Retain the order and reindex the rows.
# Drop all columns except the first column because they contain no data
# Remove the 'ZCTA5 ' from each row so we are left with just the ZCTA numbers
ZCTA = df2[df2['ZCTA5'].str[:4]=='ZCTA'].copy().reset_index(drop=True)
ZCTA.drop(ZCTA.columns[[1,2,3,4]], axis=1, inplace=True)
ZCTA = ZCTA['ZCTA5'].str.split(' ').str[1]

ZCTA

      00601
      00602
      00603
      00606
      00610
         ...  
  99923
  99925
  99926
  99927
  99929
Name: ZCTA5, Length: 33120, dtype: object

# Filter the original dataframe for only rows that start with 'Estimate' and create a new dataframe. Retain the order and reindex the rows
# Rename the columns identifying them as EST columns for each year bucket
# Drop the first column because it would just say Estimate 33,000 times
EST = df2[df2['ZCTA5'].str.contains('Estimate', regex=False)].copy().reset_index(drop=True)
EST.columns = ['ZCTA5', '2019_Or_Later_EST', '2015_To_2018_EST', '2010_To_2014_EST', '2000_To_2009_EST']
EST.drop(EST.columns[[0]], axis=1, inplace=True)

EST

	2019_Or_Later_EST	2015_To_2018_EST	2010_To_2014_EST	2000_To_2009_EST
0	-	82,300	81,700	95,100
1	-	144,600	116,100	112,200
2	260,300	134,400	130,400	124,100
3	-	58,200	97,000	92,900
4	-	115,900	110,500	112,000
...	...	...	...	...
33115	-	-	-	-
33116	-	263,900	339,200	181,300
33117	-	170,300	243,800	154,400
33118	-	-	-	-
33119	-	196,900	178,800	231,900

33120 rows × 4 columns

# Same process as above but for MOE
MOE = df2[df2['ZCTA5'].str.contains('Margin', regex=False)].copy().reset_index(drop=True)
MOE.columns = ['ZCTA5', '2019_Or_Later_MOE', '2015_To_2018_MOE', '2010_To_2014_MOE', '2000_To_2009_MOE']
MOE.drop(MOE.columns[[0]], axis=1, inplace=True)

MOE

	2019_Or_Later_MOE	2015_To_2018_MOE	2010_To_2014_MOE	2000_To_2009_MOE
0	**	10,863	29,876	20,917
1	**	45,088	15,512	20,066
2	31,083	57,761	13,058	12,732
3	**	25,881	37,195	14,438
4	**	26,647	15,103	14,945
...	...	...	...	...
33115	**	**	**	**
33116	**	31,175	33,706	19,820
33117	**	20,982	162,405	31,403
33118	**	**	**	**
33119	**	52,061	30,051	93,901

33120 rows × 4 columns

Rejoin the data#

Here we will first rejoin the EST and MOE tables to clean them before joining back to the ZCTAs. Since we want to keep ZCTA in ‘object’ or text format, not having it with the other data will allow us to clean and put the rest of the data that we do want in a numeric format easily.

# Join the EST and MOE tables back together on the index
# Remove commas because it seems to confuse pandas and doesn't allow conversion to numeric
# Remove other special characters and replace with nan
# Convert all data in the dataframe to numeric
Merge = pd.concat([EST, MOE], axis=1)
Merge = Merge.replace(',','', regex=True)
Merge = Merge.replace({'-': np.nan, '**': np.nan, '***': np.nan})
Merge = Merge.apply(pd.to_numeric, errors='coerce')

Merge
#Merge.dtypes

	2019_Or_Later_EST	2015_To_2018_EST	2010_To_2014_EST	2000_To_2009_EST	2019_Or_Later_MOE	2015_To_2018_MOE	2010_To_2014_MOE	2000_To_2009_MOE
0	NaN	82300.0	81700.0	95100.0	NaN	10863.0	29876.0	20917.0
1	NaN	144600.0	116100.0	112200.0	NaN	45088.0	15512.0	20066.0
2	260300.0	134400.0	130400.0	124100.0	31083.0	57761.0	13058.0	12732.0
3	NaN	58200.0	97000.0	92900.0	NaN	25881.0	37195.0	14438.0
4	NaN	115900.0	110500.0	112000.0	NaN	26647.0	15103.0	14945.0
...	...	...	...	...	...	...	...	...
33115	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
33116	NaN	263900.0	339200.0	181300.0	NaN	31175.0	33706.0	19820.0
33117	NaN	170300.0	243800.0	154400.0	NaN	20982.0	162405.0	31403.0
33118	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
33119	NaN	196900.0	178800.0	231900.0	NaN	52061.0	30051.0	93901.0

33120 rows × 8 columns

Note

Note that dtypes are float and we want int in the database. This is fine since the target table in the database will be int and that is how the data will import

# Join the data with its resepctive ZCTA
Merge2 = pd.concat([ZCTA, Merge], axis=1)

Merge2
#Merge2.dtypes

	ZCTA5	2019_Or_Later_EST	2015_To_2018_EST	2010_To_2014_EST	2000_To_2009_EST	2019_Or_Later_MOE	2015_To_2018_MOE	2010_To_2014_MOE	2000_To_2009_MOE
0	00601	NaN	82300.0	81700.0	95100.0	NaN	10863.0	29876.0	20917.0
1	00602	NaN	144600.0	116100.0	112200.0	NaN	45088.0	15512.0	20066.0
2	00603	260300.0	134400.0	130400.0	124100.0	31083.0	57761.0	13058.0	12732.0
3	00606	NaN	58200.0	97000.0	92900.0	NaN	25881.0	37195.0	14438.0
4	00610	NaN	115900.0	110500.0	112000.0	NaN	26647.0	15103.0	14945.0
...	...	...	...	...	...	...	...	...	...
33115	99923	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
33116	99925	NaN	263900.0	339200.0	181300.0	NaN	31175.0	33706.0	19820.0
33117	99926	NaN	170300.0	243800.0	154400.0	NaN	20982.0	162405.0	31403.0
33118	99927	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
33119	99929	NaN	196900.0	178800.0	231900.0	NaN	52061.0	30051.0	93901.0

33120 rows × 9 columns

Note

Note that ZCTA5 is dtype object and retained the leading zeros like we wanted

Break The Buckets Into Start and End Years#

Since querying for an individual year is more intuitive than querying a column bucket year range, we will now break out each each bucket to its respective years. First, we will create tables with new columns for beginning and ending years of the buckets for year range. Then we will do the equivalent of a SQL UNION to bring them all back together.

# Create a new table for only ZCTA5 and the 2019_Or_Later buckets
# Create a new column for the first year in the bucket
# Create a new column for the last year in the bucket (in this case we do not have data after 2019 so end in 2020)
# Rename the columns
Trans1 = Merge2.filter(['ZCTA5', '2019_Or_Later_EST', '2019_Or_Later_MOE' ], axis=1)
Trans1['Start_Year'] = 2019
Trans1['End_Year'] = 2020
Trans1.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans1

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year
0	00601	NaN	NaN	2019	2020
1	00602	NaN	NaN	2019	2020
2	00603	260300.0	31083.0	2019	2020
3	00606	NaN	NaN	2019	2020
4	00610	NaN	NaN	2019	2020
...	...	...	...	...	...
33115	99923	NaN	NaN	2019	2020
33116	99925	NaN	NaN	2019	2020
33117	99926	NaN	NaN	2019	2020
33118	99927	NaN	NaN	2019	2020
33119	99929	NaN	NaN	2019	2020

33120 rows × 5 columns

# Same as above except 2015 start and 2018 end
# Rename the columns same as above to allow matching for union later
Trans2 = Merge2.filter(['ZCTA5', '2015_To_2018_EST', '2015_To_2018_MOE'])
Trans2['Start_Year'] = 2015
Trans2['End_Year'] = 2018
Trans2.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans2

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year
0	00601	82300.0	10863.0	2015	2018
1	00602	144600.0	45088.0	2015	2018
2	00603	134400.0	57761.0	2015	2018
3	00606	58200.0	25881.0	2015	2018
4	00610	115900.0	26647.0	2015	2018
...	...	...	...	...	...
33115	99923	NaN	NaN	2015	2018
33116	99925	263900.0	31175.0	2015	2018
33117	99926	170300.0	20982.0	2015	2018
33118	99927	NaN	NaN	2015	2018
33119	99929	196900.0	52061.0	2015	2018

33120 rows × 5 columns

# Same as above except for 2010 start and 2014 end
Trans3 = Merge2.filter(['ZCTA5', '2010_To_2014_EST', '2010_To_2014_MOE'])
Trans3['Start_Year'] = 2010
Trans3['End_Year'] = 2014
Trans3.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans3

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year
0	00601	81700.0	29876.0	2010	2014
1	00602	116100.0	15512.0	2010	2014
2	00603	130400.0	13058.0	2010	2014
3	00606	97000.0	37195.0	2010	2014
4	00610	110500.0	15103.0	2010	2014
...	...	...	...	...	...
33115	99923	NaN	NaN	2010	2014
33116	99925	339200.0	33706.0	2010	2014
33117	99926	243800.0	162405.0	2010	2014
33118	99927	NaN	NaN	2010	2014
33119	99929	178800.0	30051.0	2010	2014

33120 rows × 5 columns

# Same as above except for 2000 start and 2009 end
Trans4 = Merge2.filter(['ZCTA5', '2000_To_2009_EST', '2000_To_2009_MOE'])
Trans4['Start_Year'] = 2000
Trans4['End_Year'] = 2009
Trans4.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans4

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year
0	00601	95100.0	20917.0	2000	2009
1	00602	112200.0	20066.0	2000	2009
2	00603	124100.0	12732.0	2000	2009
3	00606	92900.0	14438.0	2000	2009
4	00610	112000.0	14945.0	2000	2009
...	...	...	...	...	...
33115	99923	NaN	NaN	2000	2009
33116	99925	181300.0	19820.0	2000	2009
33117	99926	154400.0	31403.0	2000	2009
33118	99927	NaN	NaN	2000	2009
33119	99929	231900.0	93901.0	2000	2009

33120 rows × 5 columns

# Define the 4 new dataframes as list 'frames' 
# Bring all data back together by concatonating our frames list together. Sort the data by ZCTA and Year
frames = [Trans1, Trans2, Trans3, Trans4]
frames2 = pd.concat(frames).sort_values(['ZCTA5', 'Start_Year'])

frames2

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year
0	00601	95100.0	20917.0	2000	2009
0	00601	81700.0	29876.0	2010	2014
0	00601	82300.0	10863.0	2015	2018
0	00601	NaN	NaN	2019	2020
1	00602	112200.0	20066.0	2000	2009
...	...	...	...	...	...
33118	99927	NaN	NaN	2019	2020
33119	99929	231900.0	93901.0	2000	2009
33119	99929	178800.0	30051.0	2010	2014
33119	99929	196900.0	52061.0	2015	2018
33119	99929	NaN	NaN	2019	2020

132480 rows × 5 columns

Break The Buckets To Individual Years#

Now we could do a SQL query for Year Between Star_Year and End_Year, but we want to be able to query for a single row and single year to be consistent with some of the other tables in the database. To do this, we will use the lambda function to create a column with a list of all the years in each bucket. Then we use explode to create a row for each Year in that new Year list.

# Create list of all years in the bucket
frames2['Year'] = frames2.apply(lambda x: list(range(x['Start_Year'], x['End_Year']+1)), axis =1)

frames2

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Start_Year	End_Year	Year
0	00601	95100.0	20917.0	2000	2009	[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
0	00601	81700.0	29876.0	2010	2014	[2010, 2011, 2012, 2013, 2014]
0	00601	82300.0	10863.0	2015	2018	[2015, 2016, 2017, 2018]
0	00601	NaN	NaN	2019	2020	[2019, 2020]
1	00602	112200.0	20066.0	2000	2009	[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
...	...	...	...	...	...	...
33118	99927	NaN	NaN	2019	2020	[2019, 2020]
33119	99929	231900.0	93901.0	2000	2009	[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
33119	99929	178800.0	30051.0	2010	2014	[2010, 2011, 2012, 2013, 2014]
33119	99929	196900.0	52061.0	2015	2018	[2015, 2016, 2017, 2018]
33119	99929	NaN	NaN	2019	2020	[2019, 2020]

132480 rows × 6 columns

# Explode the Year list to new columns and drop the now unnecessary start and end year columns
# Drop any rows where there is no estimate. If there is no data, there is no reason for it to be in the databse.
final = frames2.explode('Year').drop(columns=['Start_Year','End_Year'])
final = final.dropna(subset = ['ESTIMATE'])

final

	ZCTA5	ESTIMATE	MARGIN_OF_ERROR	Year
0	00601	95100.0	20917.0	2000
0	00601	95100.0	20917.0	2001
0	00601	95100.0	20917.0	2002
0	00601	95100.0	20917.0	2003
0	00601	95100.0	20917.0	2004
...	...	...	...	...
33119	99929	178800.0	30051.0	2014
33119	99929	196900.0	52061.0	2015
33119	99929	196900.0	52061.0	2016
33119	99929	196900.0	52061.0	2017
33119	99929	196900.0	52061.0	2018

501841 rows × 4 columns

Next Steps - Zip to ZCTA#

Now that we have our data in a format that matches some of the other tables in the database, we need to be able to match this ZCTA level table to Zip Code level tables. To do this, we will use the Zip to ZCTA Crosswalk provided from the link below. This will allow us to write a Group By statement in a SQL query that will convert Zip level data to ZCTA level data that can then be joined to our ZCTA level data. This table also has the added benefit of including the City and State name for each Zip.

https://udsmapper.org/zip-code-to-zcta-crosswalk/

import pandas as pd
import numpy as np

# Specify each column as str on import or it will drop the leading zeros that we need.
df = pd.read_excel("contrib_data/ZIPCodetoZCTACrosswalk2021UDS.xlsx", dtype = {'ZIP_CODE': str, 'PO_NAME': str, 'STATE': str, 'ZIP_TYPE': str, 'ZCTA': str, 'zip_join_type': str})

df

	ZIP_CODE	PO_NAME	STATE	ZIP_TYPE	ZCTA	zip_join_type
0	00501	Holtsville	NY	Post Office or large volume customer	11742	Spatial join to ZCTA
1	00544	Holtsville	NY	Post Office or large volume customer	11742	Spatial join to ZCTA
2	00601	Adjuntas	PR	Zip Code Area	00601	Zip matches ZCTA
3	00602	Aguada	PR	Zip Code Area	00602	Zip matches ZCTA
4	00603	Aguadilla	PR	Zip Code Area	00603	Zip matches ZCTA
...	...	...	...	...	...	...
41086	99926	Metlakatla	AK	Zip Code Area	99926	Zip matches ZCTA
41087	99927	Point Baker	AK	Zip Code Area	99927	Zip matches ZCTA
41088	99928	Ward Cove	AK	Post Office or large volume customer	99901	Spatial join to ZCTA
41089	99929	Wrangell	AK	Zip Code Area	99929	Zip matches ZCTA
41090	99950	Ketchikan	AK	Zip Code Area	99921	Spatial join to ZCTA

41091 rows × 6 columns

Ready to Import to Database#

Both of the tables are now ready to import into the database. Refer to the P4Analytics page on different methods to connect and import data.

Get Querying#

The following query is an example of how we might use some of the data we have. Lets say we wanted to look at the average income and the median home sale price in all Little Rock AR Zips for 2018. Our income data is at the Zip Code level and the Median housing data as at the Zip code level. Because of the 2 different levels of data, we will first need to group the Zip code level income data into ZCTAs. We can then use this Income ZCTA table as a subquery and join it to ZCTA median housing data. Because we know that the ZCTA is associated with the most frequent Zip, we can also join it back to the Zip to ZCTA conversion table to get the City and State of that Zip/ZCTA and filter the results.

import pandas as pd
import pypyodbc as pyodbc
#Change the values below to your personal and server credentials

svr = 'the_server'
uid = 'your_user_id'
pwd = 'your_password'
db = 'the_database'

cnxn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+svr+';DATABASE='+db+';UID='+uid+';PWD='+ pwd+';TrustServerCertificate=yes;')

query = "select d.PO_NAME as City \
, d.STATE \
, INCOME.* \
, c.ESTIMATE as MEDIAN_HOUSING_ESTIMATE \
, c.ESTIMATE - c.MARGIN_OF_ERROR as MEDIAN_HOUSE_LB \
, c.ESTIMATE + c.MARGIN_OF_ERROR as MEDIAN_HOUSE_UB \
from(select b.ZCTA \
, sum(a.Number_Of_Returns) as NBR_RTRNS \
, floor(avg(a.Total_Income_Amount)) as AVG_INCOME \
, a.Date_Year \
from IRS_ZIPCODE_SOURCE a \
join ZIP_TO_ZCTA b \
on a.Zipcode = b.ZIP_CODE \
group by b.ZCTA, a.Date_Year) as INCOME \
join MEDIAN_HOUSE_VALUE c \
on INCOME.ZCTA = c.ZCTA5 \
and INCOME.Date_Year = c.Year \
join ZIP_TO_ZCTA d \
on c.ZCTA5 = d.ZIP_CODE \
where d.STATE = 'AR' \
and d.PO_NAME = 'Little Rock' \
and INCOME.Date_Year = 2018"

df = pd.read_sql(query, cnxn)
df

Python 4 Analytics

US Census Bureau Median Housing Data

Contents

US Census Bureau Median Housing Data#

ZCTA Vs Zip Code#

Obtaining The Data#

Importing The Data#

Cleaning the Data#

Break the data apart#

Rejoin the data#

Break The Buckets Into Start and End Years#

Break The Buckets To Individual Years#

Next Steps - Zip to ZCTA#

Ready to Import to Database#

Get Querying#