US Census Bureau Median Housing Data#

The US Census Bureau publicly provides an extensive amount of information that it obtains. Sifting through and finding the right data you’re looking for is difficult enough. Getting that data into a usable format that can then be uploaded to an RDBMS is something else entirely. This guide will walk us through how to retrieve the data, how to clean it and modify it within Python, and then how to export it to an Excel or CSV file that can be uploaded to an RDBMS. This is the same process used to create the MEDIAN_HOUSE_VALUE table on the USZIPCODE database for the University.

ZCTA Vs Zip Code#

For this example we will be using The Census Bureau derived ZCTA and not actual Zip Codes. You can read up on ZCTA more in the link below. We will also create a table that will allow us to convert and group Zip Codes to ZCTA so we can compare data in either format.
Census Bureau ZCTAs

TL:DR

ZCTA combine Zip Codes with very few addresses and assign the most frequent Zip within that grouping as the ZCTA.

Obtaining The Data#

1.) Navigate to https://data.census.gov/cedsci/table. Here you can search for all of the different information that the US Census Bureau publicy provides. For this example, search for “B25109” or “Median Value by Year Householder Moved Into Unit” to use the same dataset, or another table of interest that allows you to break out that data by ZCTA5 code in the next step.

2.) Click the “Geos” button in the table header and navigate to “Zip Code Tabulation Area” (we will refer to this as ZCTA from here on.) If you want to work with a smaller subset of data first, select a single state and then All ZCTA for that state. Or select the top button for all ZCTA in the US. This will likely take some time since the table is roughly 100,000 rows. Hit the “Transpose” button to format the data so that ZCTA, Estimate, and Margin of Error are the rows and the year ranges are the columns.

Note

As stated in the first step, not all tables will allow you to break out the data into ZCTA. While some of the steps can be used for data outside of ZCTA level data, this guide is explicitly for cleaning ZCTA data.



3.) The first column is now giving us a median of the medians and should be removed since we can do this calculation easily later if we wanted to. I also removed the final two columns of data before 2000, but these can be kept if you are interested in historical data going that far back. To remove these columns, click "Columns" on the far right side of the website and de-select the top two buttons with no title next to them, and the final two buttons for data before 2000.

4.) Press the CSV button on the top header to export your dataset to a CSV and file and wait for it to download. This will likely take some time since it is almost 100,000 rows.

Importing The Data#

We’re finally ready to start preparing our data! It is a good idea to move the downloaded file to the same folder you want the Jupyter Notebook or Python program to be saved in. I also renamed the file to easily identify what it was and easily read it into Jupyter.

Execute the following cell looking at both the head and the tail.

import pandas as pd
import numpy as np

df = pd.read_csv("contrib_data/US_Housing_by_zip.csv")

df.head(10)
#df.tail(10)
Label (Grouping) Median value --!!Total:!!Moved in 2019 or later Median value --!!Total:!!Moved in 2015 to 2018 Median value --!!Total:!!Moved in 2010 to 2014 Median value --!!Total:!!Moved in 2000 to 2009
0 ZCTA5 00601 NaN NaN NaN NaN
1 Estimate - 82,300 81,700 95,100
2 Margin of Error ** ±10,863 ±29,876 ±20,917
3 ZCTA5 00602 NaN NaN NaN NaN
4 Estimate - 144,600 116,100 112,200
5 Margin of Error ** ±45,088 ±15,512 ±20,066
6 ZCTA5 00603 NaN NaN NaN NaN
7 Estimate 260,300 134,400 130,400 124,100
8 Margin of Error ±31,083 ±57,761 ±13,058 ±12,732
9 ZCTA5 00606 NaN NaN NaN NaN

Cleaning the Data#

To put it simply, this data is ugly. The current format is essentially unqueryable. The data does follow a recognizable pattern though.

  • Every 3 rows are ZCTA #, Estimate, Margin of Error with the Estimate and Margin of Error below it relatting to the ZCTA above

  • The ZCTA row always starts with ‘ZCTA5’ and has no values.

  • Margin of Error always starts with ± when it has a value.

    One other thing to note is the final 3 rows are an aggregate of all United States data. Since there is no single ZCTA associated with this row and we can easily run this aggregate with a query when we are finished, we will want to drop these last 3 rows.

    To get this data into a usable format we will:

  1. Rename the columns for easier management and querying

    • This isn’t necessary, but like working with SQL aliases this makes some things much easier

  2. Remove all ± and other special characters

    • This will help with point 5 and is an easy cleaning step before processing

  3. Drop the last 3 rows for reasons stated above

  4. Match each ‘Estimate’ and ‘Margin of Error’ with the appropriate ZCTA

    • This will get the data for each ZCTA into a single row

  5. Ensure each column is the proper data type

    • Zip and ACTA can have leading zeros. We need to ensure they import with these zeros for joining

  6. Break out the year buckets into individual years

    • This will make querying more intuitive and allow for easier appending when new data for years becomes available

  7. Drop any rows in the final dataframe that have no values

    These steps will produce a final dataframe that is easy to query, easy to manage, and match the schema of the other tables in the database. We will be taking the above, and transform it to the format below.

ZCTA

Estimate

Margin_of_Error

Year

00601

82300

10863

2015

00601

81700

29876

2014

Lets get started. Execute the following cell.

# Rename the columns 
# Remove all of the ±
# Drop the last 3 rows
df2 =df
df2.columns = ['ZCTA5', '2019_Or_Later', '2015_To_2018', '2010_To_2014', '2000_To_2009']
df2 = df2.replace('^\±', '', regex=True)
df2.drop(df2.tail(3).index,inplace=True)

df2
ZCTA5 2019_Or_Later 2015_To_2018 2010_To_2014 2000_To_2009
0 ZCTA5 00601 NaN NaN NaN NaN
1 Estimate - 82,300 81,700 95,100
2 Margin of Error ** 10,863 29,876 20,917
3 ZCTA5 00602 NaN NaN NaN NaN
4 Estimate - 144,600 116,100 112,200
... ... ... ... ... ...
99355 Estimate - - - -
99356 Margin of Error ** ** ** **
99357 ZCTA5 99929 NaN NaN NaN NaN
99358 Estimate - 196,900 178,800 231,900
99359 Margin of Error ** 52,061 30,051 93,901

99360 rows × 5 columns

Break the data apart#

In the next steps we will create new dataframes for all ZCTA, Estimate, and Margin of Error rows. When we break these out into their separate tables they will retain their order. Even though each row will not directly have a ZCTA associated with it, we know that the first row will be ZCTA 00601, the second row ZCTA 00602 and so on based in the index because the rows retained their order. This will then allow us to merge these 3 dataframes back together based on their index giving us a single row for each ZCTA with the Estimate ‘EST’ and Margin of Error ‘MOE’ for each year bucket.

# Filter for only rows that start with 'ZCTA' and create a new dataframe. Retain the order and reindex the rows.
# Drop all columns except the first column because they contain no data
# Remove the 'ZCTA5 ' from each row so we are left with just the ZCTA numbers
ZCTA = df2[df2['ZCTA5'].str[:4]=='ZCTA'].copy().reset_index(drop=True)
ZCTA.drop(ZCTA.columns[[1,2,3,4]], axis=1, inplace=True)
ZCTA = ZCTA['ZCTA5'].str.split(' ').str[1]

ZCTA
0        00601
1        00602
2        00603
3        00606
4        00610
         ...  
33115    99923
33116    99925
33117    99926
33118    99927
33119    99929
Name: ZCTA5, Length: 33120, dtype: object
# Filter the original dataframe for only rows that start with 'Estimate' and create a new dataframe. Retain the order and reindex the rows
# Rename the columns identifying them as EST columns for each year bucket
# Drop the first column because it would just say Estimate 33,000 times
EST = df2[df2['ZCTA5'].str.contains('Estimate', regex=False)].copy().reset_index(drop=True)
EST.columns = ['ZCTA5', '2019_Or_Later_EST', '2015_To_2018_EST', '2010_To_2014_EST', '2000_To_2009_EST']
EST.drop(EST.columns[[0]], axis=1, inplace=True)

EST
2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST 2000_To_2009_EST
0 - 82,300 81,700 95,100
1 - 144,600 116,100 112,200
2 260,300 134,400 130,400 124,100
3 - 58,200 97,000 92,900
4 - 115,900 110,500 112,000
... ... ... ... ...
33115 - - - -
33116 - 263,900 339,200 181,300
33117 - 170,300 243,800 154,400
33118 - - - -
33119 - 196,900 178,800 231,900

33120 rows × 4 columns

# Same process as above but for MOE
MOE = df2[df2['ZCTA5'].str.contains('Margin', regex=False)].copy().reset_index(drop=True)
MOE.columns = ['ZCTA5', '2019_Or_Later_MOE', '2015_To_2018_MOE', '2010_To_2014_MOE', '2000_To_2009_MOE']
MOE.drop(MOE.columns[[0]], axis=1, inplace=True)

MOE
2019_Or_Later_MOE 2015_To_2018_MOE 2010_To_2014_MOE 2000_To_2009_MOE
0 ** 10,863 29,876 20,917
1 ** 45,088 15,512 20,066
2 31,083 57,761 13,058 12,732
3 ** 25,881 37,195 14,438
4 ** 26,647 15,103 14,945
... ... ... ... ...
33115 ** ** ** **
33116 ** 31,175 33,706 19,820
33117 ** 20,982 162,405 31,403
33118 ** ** ** **
33119 ** 52,061 30,051 93,901

33120 rows × 4 columns

Rejoin the data#

Here we will first rejoin the EST and MOE tables to clean them before joining back to the ZCTAs. Since we want to keep ZCTA in ‘object’ or text format, not having it with the other data will allow us to clean and put the rest of the data that we do want in a numeric format easily.

# Join the EST and MOE tables back together on the index
# Remove commas because it seems to confuse pandas and doesn't allow conversion to numeric
# Remove other special characters and replace with nan
# Convert all data in the dataframe to numeric
Merge = pd.concat([EST, MOE], axis=1)
Merge = Merge.replace(',','', regex=True)
Merge = Merge.replace({'-': np.nan, '**': np.nan, '***': np.nan})
Merge = Merge.apply(pd.to_numeric, errors='coerce')

Merge
#Merge.dtypes
2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST 2000_To_2009_EST 2019_Or_Later_MOE 2015_To_2018_MOE 2010_To_2014_MOE 2000_To_2009_MOE
0 NaN 82300.0 81700.0 95100.0 NaN 10863.0 29876.0 20917.0
1 NaN 144600.0 116100.0 112200.0 NaN 45088.0 15512.0 20066.0
2 260300.0 134400.0 130400.0 124100.0 31083.0 57761.0 13058.0 12732.0
3 NaN 58200.0 97000.0 92900.0 NaN 25881.0 37195.0 14438.0
4 NaN 115900.0 110500.0 112000.0 NaN 26647.0 15103.0 14945.0
... ... ... ... ... ... ... ... ...
33115 NaN NaN NaN NaN NaN NaN NaN NaN
33116 NaN 263900.0 339200.0 181300.0 NaN 31175.0 33706.0 19820.0
33117 NaN 170300.0 243800.0 154400.0 NaN 20982.0 162405.0 31403.0
33118 NaN NaN NaN NaN NaN NaN NaN NaN
33119 NaN 196900.0 178800.0 231900.0 NaN 52061.0 30051.0 93901.0

33120 rows × 8 columns

Note

Note that dtypes are float and we want int in the database. This is fine since the target table in the database will be int and that is how the data will import

# Join the data with its resepctive ZCTA
Merge2 = pd.concat([ZCTA, Merge], axis=1)

Merge2
#Merge2.dtypes
ZCTA5 2019_Or_Later_EST 2015_To_2018_EST 2010_To_2014_EST 2000_To_2009_EST 2019_Or_Later_MOE 2015_To_2018_MOE 2010_To_2014_MOE 2000_To_2009_MOE
0 00601 NaN 82300.0 81700.0 95100.0 NaN 10863.0 29876.0 20917.0
1 00602 NaN 144600.0 116100.0 112200.0 NaN 45088.0 15512.0 20066.0
2 00603 260300.0 134400.0 130400.0 124100.0 31083.0 57761.0 13058.0 12732.0
3 00606 NaN 58200.0 97000.0 92900.0 NaN 25881.0 37195.0 14438.0
4 00610 NaN 115900.0 110500.0 112000.0 NaN 26647.0 15103.0 14945.0
... ... ... ... ... ... ... ... ... ...
33115 99923 NaN NaN NaN NaN NaN NaN NaN NaN
33116 99925 NaN 263900.0 339200.0 181300.0 NaN 31175.0 33706.0 19820.0
33117 99926 NaN 170300.0 243800.0 154400.0 NaN 20982.0 162405.0 31403.0
33118 99927 NaN NaN NaN NaN NaN NaN NaN NaN
33119 99929 NaN 196900.0 178800.0 231900.0 NaN 52061.0 30051.0 93901.0

33120 rows × 9 columns

Note

Note that ZCTA5 is dtype object and retained the leading zeros like we wanted

Break The Buckets Into Start and End Years#

Since querying for an individual year is more intuitive than querying a column bucket year range, we will now break out each each bucket to its respective years. First, we will create tables with new columns for beginning and ending years of the buckets for year range. Then we will do the equivalent of a SQL UNION to bring them all back together.

# Create a new table for only ZCTA5 and the 2019_Or_Later buckets
# Create a new column for the first year in the bucket
# Create a new column for the last year in the bucket (in this case we do not have data after 2019 so end in 2020)
# Rename the columns
Trans1 = Merge2.filter(['ZCTA5', '2019_Or_Later_EST', '2019_Or_Later_MOE' ], axis=1)
Trans1['Start_Year'] = 2019
Trans1['End_Year'] = 2020
Trans1.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans1
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year
0 00601 NaN NaN 2019 2020
1 00602 NaN NaN 2019 2020
2 00603 260300.0 31083.0 2019 2020
3 00606 NaN NaN 2019 2020
4 00610 NaN NaN 2019 2020
... ... ... ... ... ...
33115 99923 NaN NaN 2019 2020
33116 99925 NaN NaN 2019 2020
33117 99926 NaN NaN 2019 2020
33118 99927 NaN NaN 2019 2020
33119 99929 NaN NaN 2019 2020

33120 rows × 5 columns

# Same as above except 2015 start and 2018 end
# Rename the columns same as above to allow matching for union later
Trans2 = Merge2.filter(['ZCTA5', '2015_To_2018_EST', '2015_To_2018_MOE'])
Trans2['Start_Year'] = 2015
Trans2['End_Year'] = 2018
Trans2.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans2
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year
0 00601 82300.0 10863.0 2015 2018
1 00602 144600.0 45088.0 2015 2018
2 00603 134400.0 57761.0 2015 2018
3 00606 58200.0 25881.0 2015 2018
4 00610 115900.0 26647.0 2015 2018
... ... ... ... ... ...
33115 99923 NaN NaN 2015 2018
33116 99925 263900.0 31175.0 2015 2018
33117 99926 170300.0 20982.0 2015 2018
33118 99927 NaN NaN 2015 2018
33119 99929 196900.0 52061.0 2015 2018

33120 rows × 5 columns

# Same as above except for 2010 start and 2014 end
Trans3 = Merge2.filter(['ZCTA5', '2010_To_2014_EST', '2010_To_2014_MOE'])
Trans3['Start_Year'] = 2010
Trans3['End_Year'] = 2014
Trans3.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans3
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year
0 00601 81700.0 29876.0 2010 2014
1 00602 116100.0 15512.0 2010 2014
2 00603 130400.0 13058.0 2010 2014
3 00606 97000.0 37195.0 2010 2014
4 00610 110500.0 15103.0 2010 2014
... ... ... ... ... ...
33115 99923 NaN NaN 2010 2014
33116 99925 339200.0 33706.0 2010 2014
33117 99926 243800.0 162405.0 2010 2014
33118 99927 NaN NaN 2010 2014
33119 99929 178800.0 30051.0 2010 2014

33120 rows × 5 columns

# Same as above except for 2000 start and 2009 end
Trans4 = Merge2.filter(['ZCTA5', '2000_To_2009_EST', '2000_To_2009_MOE'])
Trans4['Start_Year'] = 2000
Trans4['End_Year'] = 2009
Trans4.columns = ['ZCTA5', 'ESTIMATE', 'MARGIN_OF_ERROR', 'Start_Year', 'End_Year']

Trans4
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year
0 00601 95100.0 20917.0 2000 2009
1 00602 112200.0 20066.0 2000 2009
2 00603 124100.0 12732.0 2000 2009
3 00606 92900.0 14438.0 2000 2009
4 00610 112000.0 14945.0 2000 2009
... ... ... ... ... ...
33115 99923 NaN NaN 2000 2009
33116 99925 181300.0 19820.0 2000 2009
33117 99926 154400.0 31403.0 2000 2009
33118 99927 NaN NaN 2000 2009
33119 99929 231900.0 93901.0 2000 2009

33120 rows × 5 columns

# Define the 4 new dataframes as list 'frames' 
# Bring all data back together by concatonating our frames list together. Sort the data by ZCTA and Year
frames = [Trans1, Trans2, Trans3, Trans4]
frames2 = pd.concat(frames).sort_values(['ZCTA5', 'Start_Year'])

frames2
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year
0 00601 95100.0 20917.0 2000 2009
0 00601 81700.0 29876.0 2010 2014
0 00601 82300.0 10863.0 2015 2018
0 00601 NaN NaN 2019 2020
1 00602 112200.0 20066.0 2000 2009
... ... ... ... ... ...
33118 99927 NaN NaN 2019 2020
33119 99929 231900.0 93901.0 2000 2009
33119 99929 178800.0 30051.0 2010 2014
33119 99929 196900.0 52061.0 2015 2018
33119 99929 NaN NaN 2019 2020

132480 rows × 5 columns

Break The Buckets To Individual Years#

Now we could do a SQL query for Year Between Star_Year and End_Year, but we want to be able to query for a single row and single year to be consistent with some of the other tables in the database. To do this, we will use the lambda function to create a column with a list of all the years in each bucket. Then we use explode to create a row for each Year in that new Year list.

# Create list of all years in the bucket
frames2['Year'] = frames2.apply(lambda x: list(range(x['Start_Year'], x['End_Year']+1)), axis =1)

frames2
ZCTA5 ESTIMATE MARGIN_OF_ERROR Start_Year End_Year Year
0 00601 95100.0 20917.0 2000 2009 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
0 00601 81700.0 29876.0 2010 2014 [2010, 2011, 2012, 2013, 2014]
0 00601 82300.0 10863.0 2015 2018 [2015, 2016, 2017, 2018]
0 00601 NaN NaN 2019 2020 [2019, 2020]
1 00602 112200.0 20066.0 2000 2009 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
... ... ... ... ... ... ...
33118 99927 NaN NaN 2019 2020 [2019, 2020]
33119 99929 231900.0 93901.0 2000 2009 [2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...
33119 99929 178800.0 30051.0 2010 2014 [2010, 2011, 2012, 2013, 2014]
33119 99929 196900.0 52061.0 2015 2018 [2015, 2016, 2017, 2018]
33119 99929 NaN NaN 2019 2020 [2019, 2020]

132480 rows × 6 columns

# Explode the Year list to new columns and drop the now unnecessary start and end year columns
# Drop any rows where there is no estimate. If there is no data, there is no reason for it to be in the databse.
final = frames2.explode('Year').drop(columns=['Start_Year','End_Year'])
final = final.dropna(subset = ['ESTIMATE'])

final
ZCTA5 ESTIMATE MARGIN_OF_ERROR Year
0 00601 95100.0 20917.0 2000
0 00601 95100.0 20917.0 2001
0 00601 95100.0 20917.0 2002
0 00601 95100.0 20917.0 2003
0 00601 95100.0 20917.0 2004
... ... ... ... ...
33119 99929 178800.0 30051.0 2014
33119 99929 196900.0 52061.0 2015
33119 99929 196900.0 52061.0 2016
33119 99929 196900.0 52061.0 2017
33119 99929 196900.0 52061.0 2018

501841 rows × 4 columns

Next Steps - Zip to ZCTA#

Now that we have our data in a format that matches some of the other tables in the database, we need to be able to match this ZCTA level table to Zip Code level tables. To do this, we will use the Zip to ZCTA Crosswalk provided from the link below. This will allow us to write a Group By statement in a SQL query that will convert Zip level data to ZCTA level data that can then be joined to our ZCTA level data. This table also has the added benefit of including the City and State name for each Zip.

https://udsmapper.org/zip-code-to-zcta-crosswalk/

import pandas as pd
import numpy as np

# Specify each column as str on import or it will drop the leading zeros that we need.
df = pd.read_excel("contrib_data/ZIPCodetoZCTACrosswalk2021UDS.xlsx", dtype = {'ZIP_CODE': str, 'PO_NAME': str, 'STATE': str, 'ZIP_TYPE': str, 'ZCTA': str, 'zip_join_type': str})

df
ZIP_CODE PO_NAME STATE ZIP_TYPE ZCTA zip_join_type
0 00501 Holtsville NY Post Office or large volume customer 11742 Spatial join to ZCTA
1 00544 Holtsville NY Post Office or large volume customer 11742 Spatial join to ZCTA
2 00601 Adjuntas PR Zip Code Area 00601 Zip matches ZCTA
3 00602 Aguada PR Zip Code Area 00602 Zip matches ZCTA
4 00603 Aguadilla PR Zip Code Area 00603 Zip matches ZCTA
... ... ... ... ... ... ...
41086 99926 Metlakatla AK Zip Code Area 99926 Zip matches ZCTA
41087 99927 Point Baker AK Zip Code Area 99927 Zip matches ZCTA
41088 99928 Ward Cove AK Post Office or large volume customer 99901 Spatial join to ZCTA
41089 99929 Wrangell AK Zip Code Area 99929 Zip matches ZCTA
41090 99950 Ketchikan AK Zip Code Area 99921 Spatial join to ZCTA

41091 rows × 6 columns

Ready to Import to Database#

Both of the tables are now ready to import into the database. Refer to the P4Analytics page on different methods to connect and import data.

Get Querying#

The following query is an example of how we might use some of the data we have. Lets say we wanted to look at the average income and the median home sale price in all Little Rock AR Zips for 2018. Our income data is at the Zip Code level and the Median housing data as at the Zip code level. Because of the 2 different levels of data, we will first need to group the Zip code level income data into ZCTAs. We can then use this Income ZCTA table as a subquery and join it to ZCTA median housing data. Because we know that the ZCTA is associated with the most frequent Zip, we can also join it back to the Zip to ZCTA conversion table to get the City and State of that Zip/ZCTA and filter the results.

import pandas as pd
import pypyodbc as pyodbc
#Change the values below to your personal and server credentials

svr = 'the_server'
uid = 'your_user_id'
pwd = 'your_password'
db = 'the_database'

cnxn = pyodbc.connect('DRIVER={ODBC Driver 18 for SQL Server};SERVER='+svr+';DATABASE='+db+';UID='+uid+';PWD='+ pwd+';TrustServerCertificate=yes;')

query = "select d.PO_NAME as City \
, d.STATE \
, INCOME.* \
, c.ESTIMATE as MEDIAN_HOUSING_ESTIMATE \
, c.ESTIMATE - c.MARGIN_OF_ERROR as MEDIAN_HOUSE_LB \
, c.ESTIMATE + c.MARGIN_OF_ERROR as MEDIAN_HOUSE_UB \
from(select b.ZCTA \
, sum(a.Number_Of_Returns) as NBR_RTRNS \
, floor(avg(a.Total_Income_Amount)) as AVG_INCOME \
, a.Date_Year \
from IRS_ZIPCODE_SOURCE a \
join ZIP_TO_ZCTA b \
on a.Zipcode = b.ZIP_CODE \
group by b.ZCTA, a.Date_Year) as INCOME \
join MEDIAN_HOUSE_VALUE c \
on INCOME.ZCTA = c.ZCTA5 \
and INCOME.Date_Year = c.Year \
join ZIP_TO_ZCTA d \
on c.ZCTA5 = d.ZIP_CODE \
where d.STATE = 'AR' \
and d.PO_NAME = 'Little Rock' \
and INCOME.Date_Year = 2018"

df = pd.read_sql(query, cnxn)
df