Working with IRS data
Working with IRS data#
The data that I provided for my portion of the Zipcode database is IRS income data. There are two main fields that i focused on providing which are: Total Income Amount and Total Adjusted Gross Income. The IRS provides this data at a zipcode income bucket level. I will be grabbing this data from the IRS site using python and then will roll this data up to a zipcode level throughout several years of data.
My data can be accessed through the SQL tables in our database and can be paired with the other tables that are also at the Zipcode level by joining on Zipcode and Year keys.
Source Table which contains the files as is from the IRS site: IRS_ZIPCODE_SOURCE Cleaned Table which is rolled up to the Zipcode level: IRS_ZIPCODE
A data dictionary is provided at the link below covering the metrics used along with basic collection information. As years are added to the IRS data files the year range inside of the for loop should be increased.
Source Link: https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi
Simple Data Dictionary:
State: State that the data tax data is covering
Zipcode: Zipcode that the IRS data is covering
AGI Stub: These are adjusted gross income braket ranges, The source data comes at the Zip AGI level
N1: This column is the sample size or number of returns inside of the grouping
A00100: This column is the average Adjusted Gross Income for the grouping
A02650: This column is the average Total Income Amount for the grouping
import requests
import pandas as pd
import sqlalchemy
#This function takes a link to the IRS website and a year and pulls the CSV from the IRS site and adds a year column
def getdata(link, year):
df = pd.read_csv(link)
data = df[['STATE', 'zipcode', 'agi_stub', 'N1', 'A00100', 'A02650']]
data = data.rename(columns={"STATE": "State", "agi_stub":"AGI_Stub", "zipcode": "Zipcode", "N1":"Number_Of_Returns", "A00100":"Adjusted_Gross_Income",
"A02650":"Total_Income_Amount"})
data['Date_Year'] = year
return(data)
#This intializes the table that we will store the output of our for loop in
finaltable = pd.DataFrame(columns=['State', 'Zipcode', 'AGI_Stub', 'Number_Of_Returns', 'Adjusted_Gross_Income', 'Total_Income_Amount'])
This for loop will loop through the year ranges using the last two digits of a year and will create the string logic the IRS uses to then call a function to get the csv with the IRS data. Then it appends this data to the table created above.
for i in range(17,19):
year = 2000 + iy
query = 'https://www.irs.gov/pub/irs-soi/' + str(i) + 'zpallagi.csv'
data = getdata(query, year)
finaltable = finaltable.append(data)
# Credentials for the database are stored here which are later used to call the create_engine function
svr = getenv('svr')
db = getenv('db')
uid = getenv('uid')
pwd = getenv('pwd')
The below establishes the connection to our database
engine = sqlalchemy.create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=ODBC+Driver+18+for+SQL+Server&TrustServerCertificate=yes", fast_executemany=True)
This code takes the above table we created and writes it to the database under the table name “IRS_ZIPCODE_SOURCE”
try:
finaltable.to_sql("IRS_ZIPCODE_SOURCE", engine, if_exists='replace', index=False)
print("Source Table Success")
except:
print("Source Table Load Failure")
This section handles preparing and cleaning the data to be rolled up to the zipcode level. It takes a weighted average of the AGI stubs to produce an average at the zipcode level which will be used in the database.
finaltable['Total_Adjusted_Gross_Income'] = finaltable['Number_Of_Returns'] * finaltable['Adjusted_Gross_Income']
finaltable['Total_Income_Amount'] = finaltable['Number_Of_Returns'] * finaltable['Total_Income_Amount']
data = finaltable
data.drop(["Adjusted_Gross_Income"], axis = 1, inplace=True)
data = data.groupby(['State', 'Zipcode', 'Date_Year'], as_index = False).sum()
data['Average_Adjusted_Gross_Income'] = data['Total_Adjusted_Gross_Income'] / data['Number_Of_Returns']
data['Average_Total_Income_Amount'] = data['Total_Income_Amount'] / data['Number_Of_Returns']
This code takes the above table we created and writes it to the database under the table name “IRS_ZIPCODE”
try:
data.to_sql("IRS_ZIPCODE", engine, if_exists='replace', index=False)
print("Final Table Sucess")
except:
print("Final Table Load Failure")