Manipulating census data with pandas#

ZCTA5 - Age Data#

For my project, I choose the Zip Code dataset. I found the dataset from the Census Bureau age dataset

For the raw data extraction select the above census bureau link and apply the filter ‘All 5-digit zip code’. Next, select the code ‘S0101AGE AND SEX’ from the vast number of available datasets. The data is available from 2010-2020. In this project we have considered the last 4 years 2017-2020. Download the data by selecting the years. Once download is complete unzip the file and there are three files under the zip. First, the metadata file which has all the information about the columns. Second, table notes which describes about the table and finally the actual data which is a csv file. Inside the csv file there are 33,146 records and 434 columns. For this proect we are selecting the subset of columns which cosniders the total population across the age groups.

Below are the steps I have followed to transform and load the dataset to the sql database:

Import the necessary libraries required:

  1. Pandas - It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array

  2. Numpy - NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

  3. Sys - The sys module provides various functions and variables that are used to manipulate different parts of the Python runtime environment.

import pandas as pd
import numpy as np

import sys

Load the csv file as dataframe using read_csv

df = pd.read_csv('contrib_data/Age_Data_Zip.csv') #read the file
df.head(10)  #print first 10 records
Geography Geographic Area Name Estimate!!Total!!Total population Annotation of Estimate!!Total!!Total population Margin of Error!!Total MOE!!Total population Annotation of Margin of Error!!Total MOE!!Total population Estimate!!Total!!Total population!!AGE!!Under 5 years Annotation of Estimate!!Total!!Total population!!AGE!!Under 5 years Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years ... Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years Estimate!!Total!!Total population!!AGE!!80 to 84 years Annotation of Estimate!!Total!!Total population!!AGE!!80 to 84 years Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years Estimate!!Total!!Total population!!AGE!!85 years and over Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over Annotation of Estimate!!Total!!Total population!!AGE!!85 years and over
0 8600000US00601 ZCTA5 00601 17599 NaN 261 NaN 938 NaN 40 NaN ... 96 NaN 349 NaN 83 NaN 328 95 NaN NaN
1 8600000US00602 ZCTA5 00602 39209 NaN 128 NaN 1794 NaN 75 NaN ... 159 NaN 789 NaN 142 NaN 648 164 NaN NaN
2 8600000US00603 ZCTA5 00603 50135 NaN 805 NaN 2469 NaN 92 NaN ... 245 NaN 1315 NaN 229 NaN 727 158 NaN NaN
3 8600000US00606 ZCTA5 00606 6304 NaN 255 NaN 312 NaN 45 NaN ... 64 NaN 74 NaN 48 NaN 174 87 NaN NaN
4 8600000US00610 ZCTA5 00610 27590 NaN 163 NaN 1256 NaN 4 NaN ... 146 NaN 576 NaN 112 NaN 365 115 NaN NaN
5 8600000US00612 ZCTA5 00612 62566 NaN 1480 NaN 3260 NaN 241 NaN ... 319 NaN 1433 NaN 238 NaN 1578 271 NaN NaN
6 8600000US00616 ZCTA5 00616 10687 NaN 1192 NaN 469 NaN 216 NaN ... 106 NaN 326 NaN 160 NaN 320 138 NaN NaN
7 8600000US00617 ZCTA5 00617 24508 NaN 296 NaN 1340 NaN 25 NaN ... 188 NaN 450 NaN 159 NaN 503 177 NaN NaN
8 8600000US00622 ZCTA5 00622 7405 NaN 1118 NaN 224 NaN 140 NaN ... 150 NaN 334 NaN 151 NaN 241 129 NaN NaN
9 8600000US00623 ZCTA5 00623 42321 NaN 1118 NaN 1945 NaN 140 NaN ... 266 NaN 1184 NaN 213 NaN 737 189 NaN NaN

10 rows × 78 columns

Select the necessary columns from the dataset using iloc -> Integer-location based indexing for selection by position.

df1 = df.iloc[:,[1,2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74]] 

df1.head(10) #print first 10 records
Geographic Area Name Estimate!!Total!!Total population Estimate!!Total!!Total population!!AGE!!Under 5 years Estimate!!Total!!Total population!!AGE!!5 to 9 years Estimate!!Total!!Total population!!AGE!!10 to 14 years Estimate!!Total!!Total population!!AGE!!15 to 19 years Estimate!!Total!!Total population!!AGE!!20 to 24 years Estimate!!Total!!Total population!!AGE!!25 to 29 years Estimate!!Total!!Total population!!AGE!!30 to 34 years Estimate!!Total!!Total population!!AGE!!35 to 39 years Estimate!!Total!!Total population!!AGE!!40 to 44 years Estimate!!Total!!Total population!!AGE!!45 to 49 years Estimate!!Total!!Total population!!AGE!!50 to 54 years Estimate!!Total!!Total population!!AGE!!55 to 59 years Estimate!!Total!!Total population!!AGE!!60 to 64 years Estimate!!Total!!Total population!!AGE!!65 to 69 years Estimate!!Total!!Total population!!AGE!!70 to 74 years Estimate!!Total!!Total population!!AGE!!75 to 79 years Estimate!!Total!!Total population!!AGE!!80 to 84 years Estimate!!Total!!Total population!!AGE!!85 years and over
0 ZCTA5 00601 17599 938 1046 1278 1250 1218 1148 1073 1121 918 1147 1257 1215 1093 1080 616 524 349 328
1 ZCTA5 00602 39209 1794 2320 2113 2706 2751 2537 2350 2715 2227 2749 2942 2833 2810 2181 1753 991 789 648
2 ZCTA5 00603 50135 2469 2778 3260 3443 3194 3264 3021 3296 3099 3070 3336 2982 3323 3306 2489 1763 1315 727
3 ZCTA5 00606 6304 312 304 420 479 443 302 329 279 557 457 483 317 555 345 303 171 74 174
4 ZCTA5 00610 27590 1256 1346 1895 1939 1924 1624 1575 1638 1944 1937 1910 1748 1744 1629 1400 1140 576 365
5 ZCTA5 00612 62566 3260 3077 4249 4397 4346 3621 3586 3550 4225 4312 4021 3607 3903 3812 3038 2551 1433 1578
6 ZCTA5 00616 10687 469 417 405 678 901 784 574 661 687 877 787 558 533 753 712 245 326 320
7 ZCTA5 00617 24508 1340 1401 1702 1686 1709 1607 1570 1719 1766 1584 1434 1195 1521 1390 1117 814 450 503
8 ZCTA5 00622 7405 224 296 489 477 428 243 352 382 436 477 579 604 365 519 588 371 334 241
9 ZCTA5 00623 42321 1945 2599 2457 2906 2795 2528 2279 2790 3075 3050 2580 2298 2596 2652 2073 1777 1184 737

Replace the column values which has ‘ZCTA5 00001’ to 00001 using replace function.

df2 = df1.replace('ZCTA5 ','', regex=True) 

df2
Geographic Area Name Estimate!!Total!!Total population Estimate!!Total!!Total population!!AGE!!Under 5 years Estimate!!Total!!Total population!!AGE!!5 to 9 years Estimate!!Total!!Total population!!AGE!!10 to 14 years Estimate!!Total!!Total population!!AGE!!15 to 19 years Estimate!!Total!!Total population!!AGE!!20 to 24 years Estimate!!Total!!Total population!!AGE!!25 to 29 years Estimate!!Total!!Total population!!AGE!!30 to 34 years Estimate!!Total!!Total population!!AGE!!35 to 39 years Estimate!!Total!!Total population!!AGE!!40 to 44 years Estimate!!Total!!Total population!!AGE!!45 to 49 years Estimate!!Total!!Total population!!AGE!!50 to 54 years Estimate!!Total!!Total population!!AGE!!55 to 59 years Estimate!!Total!!Total population!!AGE!!60 to 64 years Estimate!!Total!!Total population!!AGE!!65 to 69 years Estimate!!Total!!Total population!!AGE!!70 to 74 years Estimate!!Total!!Total population!!AGE!!75 to 79 years Estimate!!Total!!Total population!!AGE!!80 to 84 years Estimate!!Total!!Total population!!AGE!!85 years and over
0 00601 17599 938 1046 1278 1250 1218 1148 1073 1121 918 1147 1257 1215 1093 1080 616 524 349 328
1 00602 39209 1794 2320 2113 2706 2751 2537 2350 2715 2227 2749 2942 2833 2810 2181 1753 991 789 648
2 00603 50135 2469 2778 3260 3443 3194 3264 3021 3296 3099 3070 3336 2982 3323 3306 2489 1763 1315 727
3 00606 6304 312 304 420 479 443 302 329 279 557 457 483 317 555 345 303 171 74 174
4 00610 27590 1256 1346 1895 1939 1924 1624 1575 1638 1944 1937 1910 1748 1744 1629 1400 1140 576 365
5 00612 62566 3260 3077 4249 4397 4346 3621 3586 3550 4225 4312 4021 3607 3903 3812 3038 2551 1433 1578
6 00616 10687 469 417 405 678 901 784 574 661 687 877 787 558 533 753 712 245 326 320
7 00617 24508 1340 1401 1702 1686 1709 1607 1570 1719 1766 1584 1434 1195 1521 1390 1117 814 450 503
8 00622 7405 224 296 489 477 428 243 352 382 436 477 579 604 365 519 588 371 334 241
9 00623 42321 1945 2599 2457 2906 2795 2528 2279 2790 3075 3050 2580 2298 2596 2652 2073 1777 1184 737
10 00624 23238 1396 1604 1562 1609 1684 1651 1399 1653 1108 1477 1504 1602 1413 1103 1079 806 363 225
11 00627 32936 1557 1770 2109 2333 2377 2024 1886 2366 1949 2320 2269 2051 2102 2022 1540 1109 410 742
12 00631 1555 34 54 54 75 133 116 103 56 105 78 105 110 138 114 87 68 68 57
13 00637 23610 1210 1199 1629 1665 1607 1368 1309 1384 1576 1461 1424 1501 1432 1329 1432 1056 527 501
14 00638 17202 1025 1170 993 1076 1232 1077 1041 930 1042 993 1178 1086 1305 991 772 552 497 242
15 00641 28332 1359 1546 1760 1975 2037 1832 1532 1579 1645 1766 1985 1937 2080 1717 1358 1092 494 638
16 00646 38008 2033 2570 2658 2730 2448 2078 2240 2610 3112 2767 2477 2020 2349 2258 1405 913 843 497
17 00647 5610 226 204 264 357 411 241 406 232 280 403 398 332 423 502 357 285 117 172
18 00650 14664 848 886 939 1175 1147 819 847 903 963 1058 915 715 876 851 692 460 330 240

Rename the column names to the standard names using rename function.

df2.rename(columns={'Geographic Area Name':'ZCTA5','Estimate!!Total!!Total population':'total_population','Estimate!!Total!!Total population!!AGE!!Under 5 years':'age_5under','Estimate!!Total!!Total population!!AGE!!5 to 9 years':'age_5to9y','Estimate!!Total!!Total population!!AGE!!10 to 14 years':'age_10to14y','Estimate!!Total!!Total population!!AGE!!15 to 19 years':'age_15to19y'}, inplace=True)

df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!20 to 24 years':'age_20to24y','Estimate!!Total!!Total population!!AGE!!25 to 29 years':'age_25to29y','Estimate!!Total!!Total population!!AGE!!30 to 34 years':'age_30to34y','Estimate!!Total!!Total population!!AGE!!35 to 39 years':'age_35to39y','Estimate!!Total!!Total population!!AGE!!40 to 44 years':'age_40to44y','Estimate!!Total!!Total population!!AGE!!45 to 49 years':'age_45to49y'}, inplace=True)

df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!50 to 54 years':'age_50to54y','Estimate!!Total!!Total population!!AGE!!55 to 59 years':'age_55to59y','Estimate!!Total!!Total population!!AGE!!60 to 64 years':'age_60to64y','Estimate!!Total!!Total population!!AGE!!65 to 69 years':'age_65to69y','Estimate!!Total!!Total population!!AGE!!70 to 74 years':'age_70to74y','Estimate!!Total!!Total population!!AGE!!75 to 79 years':'age_75to79y','Estimate!!Total!!Total population!!AGE!!80 to 84 years':'age_80to84y','Estimate!!Total!!Total population!!AGE!!85 years and over':'age_above85y'}, inplace=True)

Drop the duplicates from the dataframe using drop_duplicates function

df3 = df2.drop_duplicates()

df3
ZCTA5 total_population age_5under age_5to9y age_10to14y age_15to19y age_20to24y age_25to29y age_30to34y age_35to39y age_40to44y age_45to49y age_50to54y age_55to59y age_60to64y age_65to69y age_70to74y age_75to79y age_80to84y age_above85y
0 00601 17599 938 1046 1278 1250 1218 1148 1073 1121 918 1147 1257 1215 1093 1080 616 524 349 328
1 00602 39209 1794 2320 2113 2706 2751 2537 2350 2715 2227 2749 2942 2833 2810 2181 1753 991 789 648
2 00603 50135 2469 2778 3260 3443 3194 3264 3021 3296 3099 3070 3336 2982 3323 3306 2489 1763 1315 727
3 00606 6304 312 304 420 479 443 302 329 279 557 457 483 317 555 345 303 171 74 174
4 00610 27590 1256 1346 1895 1939 1924 1624 1575 1638 1944 1937 1910 1748 1744 1629 1400 1140 576 365
5 00612 62566 3260 3077 4249 4397 4346 3621 3586 3550 4225 4312 4021 3607 3903 3812 3038 2551 1433 1578
6 00616 10687 469 417 405 678 901 784 574 661 687 877 787 558 533 753 712 245 326 320
7 00617 24508 1340 1401 1702 1686 1709 1607 1570 1719 1766 1584 1434 1195 1521 1390 1117 814 450 503
8 00622 7405 224 296 489 477 428 243 352 382 436 477 579 604 365 519 588 371 334 241
9 00623 42321 1945 2599 2457 2906 2795 2528 2279 2790 3075 3050 2580 2298 2596 2652 2073 1777 1184 737
10 00624 23238 1396 1604 1562 1609 1684 1651 1399 1653 1108 1477 1504 1602 1413 1103 1079 806 363 225
11 00627 32936 1557 1770 2109 2333 2377 2024 1886 2366 1949 2320 2269 2051 2102 2022 1540 1109 410 742
12 00631 1555 34 54 54 75 133 116 103 56 105 78 105 110 138 114 87 68 68 57
13 00637 23610 1210 1199 1629 1665 1607 1368 1309 1384 1576 1461 1424 1501 1432 1329 1432 1056 527 501
14 00638 17202 1025 1170 993 1076 1232 1077 1041 930 1042 993 1178 1086 1305 991 772 552 497 242
15 00641 28332 1359 1546 1760 1975 2037 1832 1532 1579 1645 1766 1985 1937 2080 1717 1358 1092 494 638
16 00646 38008 2033 2570 2658 2730 2448 2078 2240 2610 3112 2767 2477 2020 2349 2258 1405 913 843 497
17 00647 5610 226 204 264 357 411 241 406 232 280 403 398 332 423 502 357 285 117 172
18 00650 14664 848 886 939 1175 1147 819 847 903 963 1058 915 715 876 851 692 460 330 240

Add new column to the dataframe using insert function for the year which we are loading data.

df3.insert(1, "Year", "2020", True)
df3
ZCTA5 Year total_population age_5under age_5to9y age_10to14y age_15to19y age_20to24y age_25to29y age_30to34y ... age_40to44y age_45to49y age_50to54y age_55to59y age_60to64y age_65to69y age_70to74y age_75to79y age_80to84y age_above85y
0 00601 2020 17599 938 1046 1278 1250 1218 1148 1073 ... 918 1147 1257 1215 1093 1080 616 524 349 328
1 00602 2020 39209 1794 2320 2113 2706 2751 2537 2350 ... 2227 2749 2942 2833 2810 2181 1753 991 789 648
2 00603 2020 50135 2469 2778 3260 3443 3194 3264 3021 ... 3099 3070 3336 2982 3323 3306 2489 1763 1315 727
3 00606 2020 6304 312 304 420 479 443 302 329 ... 557 457 483 317 555 345 303 171 74 174
4 00610 2020 27590 1256 1346 1895 1939 1924 1624 1575 ... 1944 1937 1910 1748 1744 1629 1400 1140 576 365
5 00612 2020 62566 3260 3077 4249 4397 4346 3621 3586 ... 4225 4312 4021 3607 3903 3812 3038 2551 1433 1578
6 00616 2020 10687 469 417 405 678 901 784 574 ... 687 877 787 558 533 753 712 245 326 320
7 00617 2020 24508 1340 1401 1702 1686 1709 1607 1570 ... 1766 1584 1434 1195 1521 1390 1117 814 450 503
8 00622 2020 7405 224 296 489 477 428 243 352 ... 436 477 579 604 365 519 588 371 334 241
9 00623 2020 42321 1945 2599 2457 2906 2795 2528 2279 ... 3075 3050 2580 2298 2596 2652 2073 1777 1184 737
10 00624 2020 23238 1396 1604 1562 1609 1684 1651 1399 ... 1108 1477 1504 1602 1413 1103 1079 806 363 225
11 00627 2020 32936 1557 1770 2109 2333 2377 2024 1886 ... 1949 2320 2269 2051 2102 2022 1540 1109 410 742
12 00631 2020 1555 34 54 54 75 133 116 103 ... 105 78 105 110 138 114 87 68 68 57
13 00637 2020 23610 1210 1199 1629 1665 1607 1368 1309 ... 1576 1461 1424 1501 1432 1329 1432 1056 527 501
14 00638 2020 17202 1025 1170 993 1076 1232 1077 1041 ... 1042 993 1178 1086 1305 991 772 552 497 242
15 00641 2020 28332 1359 1546 1760 1975 2037 1832 1532 ... 1645 1766 1985 1937 2080 1717 1358 1092 494 638
16 00646 2020 38008 2033 2570 2658 2730 2448 2078 2240 ... 3112 2767 2477 2020 2349 2258 1405 913 843 497
17 00647 2020 5610 226 204 264 357 411 241 406 ... 280 403 398 332 423 502 357 285 117 172
18 00650 2020 14664 848 886 939 1175 1147 819 847 ... 963 1058 915 715 876 851 692 460 330 240

19 rows × 21 columns

If there are files for each year then oncatenate the data for all these years using concat function. Here I just used df3 as I showed only for one year, similarly we can create multiple dataframes for each year and concatenate using the below function

dataframes = [df3]
  
result = pd.concat(dataframes)

Load the result to the sql server database#

Now, connect to the sql and load the dataframe as new table using SQLAlchemy. It is a sql toolkit to connect and perform all the necessary sql operations from python.

svr is the sql server name
uid is the user id
pwd is the password
db is the database which is USZIPCODE
create_engine will establish python jdbc to the sql

from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'XXXXXX'
pwd = 'XXXXXX'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
result.to_sql('AGE',engine, if_exists='replace')

The above peice of code will create a new table called ‘AGE’ in the ‘USZIPCODE’ database. The table will have all the columns which are present in the results dataframe.

We can validate the data by logging into the sql server directly or we can read the table using read_sql as shown below:

df = pd.read_sql('select * from AGE', engine)
print('***Age data for the ZCTA5***')
print(df)

Use cases#

There are many use cases which can be derived from the age group dataset. Below are few use cases. For all the use cases below, please update the following fields in the below python code before executing

  1. uid - user id

  2. pwd - password

  3. SQL Query - the sql query which needs to be executed

from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'xxx'
pwd = 'xxx'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
df = pd.read_sql('SQL Query', engine)
print(df)

Use case 1:#

Find the total population of all age groups across all years and ZCTA5

select year, sum(total_population) as total_population 
from [USZIPCODE].[dbo].[AGE] 
group by year;

Use case 2:#

Top five ZCTA5 codes where the population is highest

select top(5) zcta5,total_population,year 
from [USZIPCODE].[dbo].[AGE] 
where year=2020 
order by total_population desc;

Use case 3:#

ZCTA5 with the highest teenage population for a given year

select top(1) ZCTA5, total 
from (  select ZCTA5,sum(age_10to14y + age_15to19y) as total 
        from [USZIPCODE].[dbo].[AGE]
        where year=2020 
        group by ZCTA5)
order by total desc;

Use case 4:#

Population growth across years

select year,total_population, total_population-LAG(total_population) OVER (ORDER BY year) AS population_diff  
from (  select year, sum(total_population) as total_population 
        from [USZIPCODE].[dbo].[AGE] 
        group by year);