Manipulating census data with pandas#

ZCTA5 - Age Data#

For my project, I choose the Zip Code dataset. I found the dataset from the Census Bureau age dataset

For the raw data extraction select the above census bureau link and apply the filter ‘All 5-digit zip code’. Next, select the code ‘S0101AGE AND SEX’ from the vast number of available datasets. The data is available from 2010-2020. In this project we have considered the last 4 years 2017-2020. Download the data by selecting the years. Once download is complete unzip the file and there are three files under the zip. First, the metadata file which has all the information about the columns. Second, table notes which describes about the table and finally the actual data which is a csv file. Inside the csv file there are 33,146 records and 434 columns. For this proect we are selecting the subset of columns which cosniders the total population across the age groups.

Below are the steps I have followed to transform and load the dataset to the sql database:

Import the necessary libraries required:

Pandas - It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array
Numpy - NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
Sys - The sys module provides various functions and variables that are used to manipulate different parts of the Python runtime environment.

import pandas as pd
import numpy as np

import sys

Load the csv file as dataframe using read_csv

df = pd.read_csv('contrib_data/Age_Data_Zip.csv') #read the file
df.head(10)  #print first 10 records

	Geography	Geographic Area Name	Estimate!!Total!!Total population	Annotation of Estimate!!Total!!Total population	Margin of Error!!Total MOE!!Total population	Annotation of Margin of Error!!Total MOE!!Total population	Estimate!!Total!!Total population!!AGE!!Under 5 years	Annotation of Estimate!!Total!!Total population!!AGE!!Under 5 years	Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years	Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years	...	Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years	Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years	Estimate!!Total!!Total population!!AGE!!80 to 84 years	Annotation of Estimate!!Total!!Total population!!AGE!!80 to 84 years	Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years	Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years	Estimate!!Total!!Total population!!AGE!!85 years and over	Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over	Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over	Annotation of Estimate!!Total!!Total population!!AGE!!85 years and over
0	8600000US00601	ZCTA5 00601	17599	NaN	261	NaN	938	NaN	40	NaN	...	96	NaN	349	NaN	83	NaN	328	95	NaN	NaN
1	8600000US00602	ZCTA5 00602	39209	NaN	128	NaN	1794	NaN	75	NaN	...	159	NaN	789	NaN	142	NaN	648	164	NaN	NaN
2	8600000US00603	ZCTA5 00603	50135	NaN	805	NaN	2469	NaN	92	NaN	...	245	NaN	1315	NaN	229	NaN	727	158	NaN	NaN
3	8600000US00606	ZCTA5 00606	6304	NaN	255	NaN	312	NaN	45	NaN	...	64	NaN	74	NaN	48	NaN	174	87	NaN	NaN
4	8600000US00610	ZCTA5 00610	27590	NaN	163	NaN	1256	NaN	4	NaN	...	146	NaN	576	NaN	112	NaN	365	115	NaN	NaN
5	8600000US00612	ZCTA5 00612	62566	NaN	1480	NaN	3260	NaN	241	NaN	...	319	NaN	1433	NaN	238	NaN	1578	271	NaN	NaN
6	8600000US00616	ZCTA5 00616	10687	NaN	1192	NaN	469	NaN	216	NaN	...	106	NaN	326	NaN	160	NaN	320	138	NaN	NaN
7	8600000US00617	ZCTA5 00617	24508	NaN	296	NaN	1340	NaN	25	NaN	...	188	NaN	450	NaN	159	NaN	503	177	NaN	NaN
8	8600000US00622	ZCTA5 00622	7405	NaN	1118	NaN	224	NaN	140	NaN	...	150	NaN	334	NaN	151	NaN	241	129	NaN	NaN
9	8600000US00623	ZCTA5 00623	42321	NaN	1118	NaN	1945	NaN	140	NaN	...	266	NaN	1184	NaN	213	NaN	737	189	NaN	NaN

10 rows × 78 columns

Select the necessary columns from the dataset using iloc -> Integer-location based indexing for selection by position.

df1 = df.iloc[:,[1,2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74]] 

df1.head(10) #print first 10 records

	Geographic Area Name	Estimate!!Total!!Total population	Estimate!!Total!!Total population!!AGE!!Under 5 years	Estimate!!Total!!Total population!!AGE!!5 to 9 years	Estimate!!Total!!Total population!!AGE!!10 to 14 years	Estimate!!Total!!Total population!!AGE!!15 to 19 years	Estimate!!Total!!Total population!!AGE!!20 to 24 years	Estimate!!Total!!Total population!!AGE!!25 to 29 years	Estimate!!Total!!Total population!!AGE!!30 to 34 years	Estimate!!Total!!Total population!!AGE!!35 to 39 years	Estimate!!Total!!Total population!!AGE!!40 to 44 years	Estimate!!Total!!Total population!!AGE!!45 to 49 years	Estimate!!Total!!Total population!!AGE!!50 to 54 years	Estimate!!Total!!Total population!!AGE!!55 to 59 years	Estimate!!Total!!Total population!!AGE!!60 to 64 years	Estimate!!Total!!Total population!!AGE!!65 to 69 years	Estimate!!Total!!Total population!!AGE!!70 to 74 years	Estimate!!Total!!Total population!!AGE!!75 to 79 years	Estimate!!Total!!Total population!!AGE!!80 to 84 years	Estimate!!Total!!Total population!!AGE!!85 years and over
0	ZCTA5 00601	17599	938	1046	1278	1250	1218	1148	1073	1121	918	1147	1257	1215	1093	1080	616	524	349	328
1	ZCTA5 00602	39209	1794	2320	2113	2706	2751	2537	2350	2715	2227	2749	2942	2833	2810	2181	1753	991	789	648
2	ZCTA5 00603	50135	2469	2778	3260	3443	3194	3264	3021	3296	3099	3070	3336	2982	3323	3306	2489	1763	1315	727
3	ZCTA5 00606	6304	312	304	420	479	443	302	329	279	557	457	483	317	555	345	303	171	74	174
4	ZCTA5 00610	27590	1256	1346	1895	1939	1924	1624	1575	1638	1944	1937	1910	1748	1744	1629	1400	1140	576	365
5	ZCTA5 00612	62566	3260	3077	4249	4397	4346	3621	3586	3550	4225	4312	4021	3607	3903	3812	3038	2551	1433	1578
6	ZCTA5 00616	10687	469	417	405	678	901	784	574	661	687	877	787	558	533	753	712	245	326	320
7	ZCTA5 00617	24508	1340	1401	1702	1686	1709	1607	1570	1719	1766	1584	1434	1195	1521	1390	1117	814	450	503
8	ZCTA5 00622	7405	224	296	489	477	428	243	352	382	436	477	579	604	365	519	588	371	334	241
9	ZCTA5 00623	42321	1945	2599	2457	2906	2795	2528	2279	2790	3075	3050	2580	2298	2596	2652	2073	1777	1184	737

Replace the column values which has ‘ZCTA5 00001’ to 00001 using replace function.

df2 = df1.replace('ZCTA5 ','', regex=True) 

df2

	Geographic Area Name	Estimate!!Total!!Total population	Estimate!!Total!!Total population!!AGE!!Under 5 years	Estimate!!Total!!Total population!!AGE!!5 to 9 years	Estimate!!Total!!Total population!!AGE!!10 to 14 years	Estimate!!Total!!Total population!!AGE!!15 to 19 years	Estimate!!Total!!Total population!!AGE!!20 to 24 years	Estimate!!Total!!Total population!!AGE!!25 to 29 years	Estimate!!Total!!Total population!!AGE!!30 to 34 years	Estimate!!Total!!Total population!!AGE!!35 to 39 years	Estimate!!Total!!Total population!!AGE!!40 to 44 years	Estimate!!Total!!Total population!!AGE!!45 to 49 years	Estimate!!Total!!Total population!!AGE!!50 to 54 years	Estimate!!Total!!Total population!!AGE!!55 to 59 years	Estimate!!Total!!Total population!!AGE!!60 to 64 years	Estimate!!Total!!Total population!!AGE!!65 to 69 years	Estimate!!Total!!Total population!!AGE!!70 to 74 years	Estimate!!Total!!Total population!!AGE!!75 to 79 years	Estimate!!Total!!Total population!!AGE!!80 to 84 years	Estimate!!Total!!Total population!!AGE!!85 years and over
0	00601	17599	938	1046	1278	1250	1218	1148	1073	1121	918	1147	1257	1215	1093	1080	616	524	349	328
1	00602	39209	1794	2320	2113	2706	2751	2537	2350	2715	2227	2749	2942	2833	2810	2181	1753	991	789	648
2	00603	50135	2469	2778	3260	3443	3194	3264	3021	3296	3099	3070	3336	2982	3323	3306	2489	1763	1315	727
3	00606	6304	312	304	420	479	443	302	329	279	557	457	483	317	555	345	303	171	74	174
4	00610	27590	1256	1346	1895	1939	1924	1624	1575	1638	1944	1937	1910	1748	1744	1629	1400	1140	576	365
5	00612	62566	3260	3077	4249	4397	4346	3621	3586	3550	4225	4312	4021	3607	3903	3812	3038	2551	1433	1578
6	00616	10687	469	417	405	678	901	784	574	661	687	877	787	558	533	753	712	245	326	320
7	00617	24508	1340	1401	1702	1686	1709	1607	1570	1719	1766	1584	1434	1195	1521	1390	1117	814	450	503
8	00622	7405	224	296	489	477	428	243	352	382	436	477	579	604	365	519	588	371	334	241
9	00623	42321	1945	2599	2457	2906	2795	2528	2279	2790	3075	3050	2580	2298	2596	2652	2073	1777	1184	737
10	00624	23238	1396	1604	1562	1609	1684	1651	1399	1653	1108	1477	1504	1602	1413	1103	1079	806	363	225
11	00627	32936	1557	1770	2109	2333	2377	2024	1886	2366	1949	2320	2269	2051	2102	2022	1540	1109	410	742
12	00631	1555	34	54	54	75	133	116	103	56	105	78	105	110	138	114	87	68	68	57
13	00637	23610	1210	1199	1629	1665	1607	1368	1309	1384	1576	1461	1424	1501	1432	1329	1432	1056	527	501
14	00638	17202	1025	1170	993	1076	1232	1077	1041	930	1042	993	1178	1086	1305	991	772	552	497	242
15	00641	28332	1359	1546	1760	1975	2037	1832	1532	1579	1645	1766	1985	1937	2080	1717	1358	1092	494	638
16	00646	38008	2033	2570	2658	2730	2448	2078	2240	2610	3112	2767	2477	2020	2349	2258	1405	913	843	497
17	00647	5610	226	204	264	357	411	241	406	232	280	403	398	332	423	502	357	285	117	172
18	00650	14664	848	886	939	1175	1147	819	847	903	963	1058	915	715	876	851	692	460	330	240

Rename the column names to the standard names using rename function.

df2.rename(columns={'Geographic Area Name':'ZCTA5','Estimate!!Total!!Total population':'total_population','Estimate!!Total!!Total population!!AGE!!Under 5 years':'age_5under','Estimate!!Total!!Total population!!AGE!!5 to 9 years':'age_5to9y','Estimate!!Total!!Total population!!AGE!!10 to 14 years':'age_10to14y','Estimate!!Total!!Total population!!AGE!!15 to 19 years':'age_15to19y'}, inplace=True)

df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!20 to 24 years':'age_20to24y','Estimate!!Total!!Total population!!AGE!!25 to 29 years':'age_25to29y','Estimate!!Total!!Total population!!AGE!!30 to 34 years':'age_30to34y','Estimate!!Total!!Total population!!AGE!!35 to 39 years':'age_35to39y','Estimate!!Total!!Total population!!AGE!!40 to 44 years':'age_40to44y','Estimate!!Total!!Total population!!AGE!!45 to 49 years':'age_45to49y'}, inplace=True)

df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!50 to 54 years':'age_50to54y','Estimate!!Total!!Total population!!AGE!!55 to 59 years':'age_55to59y','Estimate!!Total!!Total population!!AGE!!60 to 64 years':'age_60to64y','Estimate!!Total!!Total population!!AGE!!65 to 69 years':'age_65to69y','Estimate!!Total!!Total population!!AGE!!70 to 74 years':'age_70to74y','Estimate!!Total!!Total population!!AGE!!75 to 79 years':'age_75to79y','Estimate!!Total!!Total population!!AGE!!80 to 84 years':'age_80to84y','Estimate!!Total!!Total population!!AGE!!85 years and over':'age_above85y'}, inplace=True)

Drop the duplicates from the dataframe using drop_duplicates function

df3 = df2.drop_duplicates()

df3

	ZCTA5	total_population	age_5under	age_5to9y	age_10to14y	age_15to19y	age_20to24y	age_25to29y	age_30to34y	age_35to39y	age_40to44y	age_45to49y	age_50to54y	age_55to59y	age_60to64y	age_65to69y	age_70to74y	age_75to79y	age_80to84y	age_above85y
0	00601	17599	938	1046	1278	1250	1218	1148	1073	1121	918	1147	1257	1215	1093	1080	616	524	349	328
1	00602	39209	1794	2320	2113	2706	2751	2537	2350	2715	2227	2749	2942	2833	2810	2181	1753	991	789	648
2	00603	50135	2469	2778	3260	3443	3194	3264	3021	3296	3099	3070	3336	2982	3323	3306	2489	1763	1315	727
3	00606	6304	312	304	420	479	443	302	329	279	557	457	483	317	555	345	303	171	74	174
4	00610	27590	1256	1346	1895	1939	1924	1624	1575	1638	1944	1937	1910	1748	1744	1629	1400	1140	576	365
5	00612	62566	3260	3077	4249	4397	4346	3621	3586	3550	4225	4312	4021	3607	3903	3812	3038	2551	1433	1578
6	00616	10687	469	417	405	678	901	784	574	661	687	877	787	558	533	753	712	245	326	320
7	00617	24508	1340	1401	1702	1686	1709	1607	1570	1719	1766	1584	1434	1195	1521	1390	1117	814	450	503
8	00622	7405	224	296	489	477	428	243	352	382	436	477	579	604	365	519	588	371	334	241
9	00623	42321	1945	2599	2457	2906	2795	2528	2279	2790	3075	3050	2580	2298	2596	2652	2073	1777	1184	737
10	00624	23238	1396	1604	1562	1609	1684	1651	1399	1653	1108	1477	1504	1602	1413	1103	1079	806	363	225
11	00627	32936	1557	1770	2109	2333	2377	2024	1886	2366	1949	2320	2269	2051	2102	2022	1540	1109	410	742
12	00631	1555	34	54	54	75	133	116	103	56	105	78	105	110	138	114	87	68	68	57
13	00637	23610	1210	1199	1629	1665	1607	1368	1309	1384	1576	1461	1424	1501	1432	1329	1432	1056	527	501
14	00638	17202	1025	1170	993	1076	1232	1077	1041	930	1042	993	1178	1086	1305	991	772	552	497	242
15	00641	28332	1359	1546	1760	1975	2037	1832	1532	1579	1645	1766	1985	1937	2080	1717	1358	1092	494	638
16	00646	38008	2033	2570	2658	2730	2448	2078	2240	2610	3112	2767	2477	2020	2349	2258	1405	913	843	497
17	00647	5610	226	204	264	357	411	241	406	232	280	403	398	332	423	502	357	285	117	172
18	00650	14664	848	886	939	1175	1147	819	847	903	963	1058	915	715	876	851	692	460	330	240

Add new column to the dataframe using insert function for the year which we are loading data.

df3.insert(1, "Year", "2020", True)
df3

	ZCTA5	Year	total_population	age_5under	age_5to9y	age_10to14y	age_15to19y	age_20to24y	age_25to29y	age_30to34y	...	age_40to44y	age_45to49y	age_50to54y	age_55to59y	age_60to64y	age_65to69y	age_70to74y	age_75to79y	age_80to84y	age_above85y
0	00601	2020	17599	938	1046	1278	1250	1218	1148	1073	...	918	1147	1257	1215	1093	1080	616	524	349	328
1	00602	2020	39209	1794	2320	2113	2706	2751	2537	2350	...	2227	2749	2942	2833	2810	2181	1753	991	789	648
2	00603	2020	50135	2469	2778	3260	3443	3194	3264	3021	...	3099	3070	3336	2982	3323	3306	2489	1763	1315	727
3	00606	2020	6304	312	304	420	479	443	302	329	...	557	457	483	317	555	345	303	171	74	174
4	00610	2020	27590	1256	1346	1895	1939	1924	1624	1575	...	1944	1937	1910	1748	1744	1629	1400	1140	576	365
5	00612	2020	62566	3260	3077	4249	4397	4346	3621	3586	...	4225	4312	4021	3607	3903	3812	3038	2551	1433	1578
6	00616	2020	10687	469	417	405	678	901	784	574	...	687	877	787	558	533	753	712	245	326	320
7	00617	2020	24508	1340	1401	1702	1686	1709	1607	1570	...	1766	1584	1434	1195	1521	1390	1117	814	450	503
8	00622	2020	7405	224	296	489	477	428	243	352	...	436	477	579	604	365	519	588	371	334	241
9	00623	2020	42321	1945	2599	2457	2906	2795	2528	2279	...	3075	3050	2580	2298	2596	2652	2073	1777	1184	737
10	00624	2020	23238	1396	1604	1562	1609	1684	1651	1399	...	1108	1477	1504	1602	1413	1103	1079	806	363	225
11	00627	2020	32936	1557	1770	2109	2333	2377	2024	1886	...	1949	2320	2269	2051	2102	2022	1540	1109	410	742
12	00631	2020	1555	34	54	54	75	133	116	103	...	105	78	105	110	138	114	87	68	68	57
13	00637	2020	23610	1210	1199	1629	1665	1607	1368	1309	...	1576	1461	1424	1501	1432	1329	1432	1056	527	501
14	00638	2020	17202	1025	1170	993	1076	1232	1077	1041	...	1042	993	1178	1086	1305	991	772	552	497	242
15	00641	2020	28332	1359	1546	1760	1975	2037	1832	1532	...	1645	1766	1985	1937	2080	1717	1358	1092	494	638
16	00646	2020	38008	2033	2570	2658	2730	2448	2078	2240	...	3112	2767	2477	2020	2349	2258	1405	913	843	497
17	00647	2020	5610	226	204	264	357	411	241	406	...	280	403	398	332	423	502	357	285	117	172
18	00650	2020	14664	848	886	939	1175	1147	819	847	...	963	1058	915	715	876	851	692	460	330	240

19 rows × 21 columns

If there are files for each year then oncatenate the data for all these years using concat function. Here I just used df3 as I showed only for one year, similarly we can create multiple dataframes for each year and concatenate using the below function

dataframes = [df3]
  
result = pd.concat(dataframes)

Load the result to the sql server database#

Now, connect to the sql and load the dataframe as new table using SQLAlchemy. It is a sql toolkit to connect and perform all the necessary sql operations from python.

svr is the sql server name
uid is the user id
pwd is the password
db is the database which is USZIPCODE
create_engine will establish python jdbc to the sql

from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'XXXXXX'
pwd = 'XXXXXX'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
result.to_sql('AGE',engine, if_exists='replace')

The above peice of code will create a new table called ‘AGE’ in the ‘USZIPCODE’ database. The table will have all the columns which are present in the results dataframe.

We can validate the data by logging into the sql server directly or we can read the table using read_sql as shown below:

df = pd.read_sql('select * from AGE', engine)
print('***Age data for the ZCTA5***')
print(df)

Use cases#

There are many use cases which can be derived from the age group dataset. Below are few use cases. For all the use cases below, please update the following fields in the below python code before executing

uid - user id
pwd - password
SQL Query - the sql query which needs to be executed

from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'xxx'
pwd = 'xxx'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
df = pd.read_sql('SQL Query', engine)
print(df)

Use case 1:#

Find the total population of all age groups across all years and ZCTA5

select year, sum(total_population) as total_population 
from [USZIPCODE].[dbo].[AGE] 
group by year;

Use case 2:#

Top five ZCTA5 codes where the population is highest

select top(5) zcta5,total_population,year 
from [USZIPCODE].[dbo].[AGE] 
where year=2020 
order by total_population desc;

Use case 3:#

ZCTA5 with the highest teenage population for a given year

select top(1) ZCTA5, total 
from (  select ZCTA5,sum(age_10to14y + age_15to19y) as total 
        from [USZIPCODE].[dbo].[AGE]
        where year=2020 
        group by ZCTA5)
order by total desc;

Use case 4:#

Population growth across years

select year,total_population, total_population-LAG(total_population) OVER (ORDER BY year) AS population_diff  
from (  select year, sum(total_population) as total_population 
        from [USZIPCODE].[dbo].[AGE] 
        group by year);

Python 4 Analytics

Manipulating census data with pandas

Contents