Manipulating census data with pandas
Contents
Manipulating census data with pandas#
ZCTA5 - Age Data#
For my project, I choose the Zip Code dataset. I found the dataset from the Census Bureau age dataset
For the raw data extraction select the above census bureau link and apply the filter ‘All 5-digit zip code’. Next, select the code ‘S0101AGE AND SEX’ from the vast number of available datasets. The data is available from 2010-2020. In this project we have considered the last 4 years 2017-2020. Download the data by selecting the years. Once download is complete unzip the file and there are three files under the zip. First, the metadata file which has all the information about the columns. Second, table notes which describes about the table and finally the actual data which is a csv file. Inside the csv file there are 33,146 records and 434 columns. For this proect we are selecting the subset of columns which cosniders the total population across the age groups.
Below are the steps I have followed to transform and load the dataset to the sql database:
Import the necessary libraries required:
Pandas - It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array
Numpy - NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
Sys - The sys module provides various functions and variables that are used to manipulate different parts of the Python runtime environment.
import pandas as pd
import numpy as np
import sys
Load the csv file as dataframe using read_csv
df = pd.read_csv('contrib_data/Age_Data_Zip.csv') #read the file
df.head(10) #print first 10 records
Geography | Geographic Area Name | Estimate!!Total!!Total population | Annotation of Estimate!!Total!!Total population | Margin of Error!!Total MOE!!Total population | Annotation of Margin of Error!!Total MOE!!Total population | Estimate!!Total!!Total population!!AGE!!Under 5 years | Annotation of Estimate!!Total!!Total population!!AGE!!Under 5 years | Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years | Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years | ... | Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years | Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years | Estimate!!Total!!Total population!!AGE!!80 to 84 years | Annotation of Estimate!!Total!!Total population!!AGE!!80 to 84 years | Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years | Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years | Estimate!!Total!!Total population!!AGE!!85 years and over | Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over | Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over | Annotation of Estimate!!Total!!Total population!!AGE!!85 years and over | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8600000US00601 | ZCTA5 00601 | 17599 | NaN | 261 | NaN | 938 | NaN | 40 | NaN | ... | 96 | NaN | 349 | NaN | 83 | NaN | 328 | 95 | NaN | NaN |
1 | 8600000US00602 | ZCTA5 00602 | 39209 | NaN | 128 | NaN | 1794 | NaN | 75 | NaN | ... | 159 | NaN | 789 | NaN | 142 | NaN | 648 | 164 | NaN | NaN |
2 | 8600000US00603 | ZCTA5 00603 | 50135 | NaN | 805 | NaN | 2469 | NaN | 92 | NaN | ... | 245 | NaN | 1315 | NaN | 229 | NaN | 727 | 158 | NaN | NaN |
3 | 8600000US00606 | ZCTA5 00606 | 6304 | NaN | 255 | NaN | 312 | NaN | 45 | NaN | ... | 64 | NaN | 74 | NaN | 48 | NaN | 174 | 87 | NaN | NaN |
4 | 8600000US00610 | ZCTA5 00610 | 27590 | NaN | 163 | NaN | 1256 | NaN | 4 | NaN | ... | 146 | NaN | 576 | NaN | 112 | NaN | 365 | 115 | NaN | NaN |
5 | 8600000US00612 | ZCTA5 00612 | 62566 | NaN | 1480 | NaN | 3260 | NaN | 241 | NaN | ... | 319 | NaN | 1433 | NaN | 238 | NaN | 1578 | 271 | NaN | NaN |
6 | 8600000US00616 | ZCTA5 00616 | 10687 | NaN | 1192 | NaN | 469 | NaN | 216 | NaN | ... | 106 | NaN | 326 | NaN | 160 | NaN | 320 | 138 | NaN | NaN |
7 | 8600000US00617 | ZCTA5 00617 | 24508 | NaN | 296 | NaN | 1340 | NaN | 25 | NaN | ... | 188 | NaN | 450 | NaN | 159 | NaN | 503 | 177 | NaN | NaN |
8 | 8600000US00622 | ZCTA5 00622 | 7405 | NaN | 1118 | NaN | 224 | NaN | 140 | NaN | ... | 150 | NaN | 334 | NaN | 151 | NaN | 241 | 129 | NaN | NaN |
9 | 8600000US00623 | ZCTA5 00623 | 42321 | NaN | 1118 | NaN | 1945 | NaN | 140 | NaN | ... | 266 | NaN | 1184 | NaN | 213 | NaN | 737 | 189 | NaN | NaN |
10 rows × 78 columns
Select the necessary columns from the dataset using iloc -> Integer-location based indexing for selection by position.
df1 = df.iloc[:,[1,2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74]]
df1.head(10) #print first 10 records
Geographic Area Name | Estimate!!Total!!Total population | Estimate!!Total!!Total population!!AGE!!Under 5 years | Estimate!!Total!!Total population!!AGE!!5 to 9 years | Estimate!!Total!!Total population!!AGE!!10 to 14 years | Estimate!!Total!!Total population!!AGE!!15 to 19 years | Estimate!!Total!!Total population!!AGE!!20 to 24 years | Estimate!!Total!!Total population!!AGE!!25 to 29 years | Estimate!!Total!!Total population!!AGE!!30 to 34 years | Estimate!!Total!!Total population!!AGE!!35 to 39 years | Estimate!!Total!!Total population!!AGE!!40 to 44 years | Estimate!!Total!!Total population!!AGE!!45 to 49 years | Estimate!!Total!!Total population!!AGE!!50 to 54 years | Estimate!!Total!!Total population!!AGE!!55 to 59 years | Estimate!!Total!!Total population!!AGE!!60 to 64 years | Estimate!!Total!!Total population!!AGE!!65 to 69 years | Estimate!!Total!!Total population!!AGE!!70 to 74 years | Estimate!!Total!!Total population!!AGE!!75 to 79 years | Estimate!!Total!!Total population!!AGE!!80 to 84 years | Estimate!!Total!!Total population!!AGE!!85 years and over | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ZCTA5 00601 | 17599 | 938 | 1046 | 1278 | 1250 | 1218 | 1148 | 1073 | 1121 | 918 | 1147 | 1257 | 1215 | 1093 | 1080 | 616 | 524 | 349 | 328 |
1 | ZCTA5 00602 | 39209 | 1794 | 2320 | 2113 | 2706 | 2751 | 2537 | 2350 | 2715 | 2227 | 2749 | 2942 | 2833 | 2810 | 2181 | 1753 | 991 | 789 | 648 |
2 | ZCTA5 00603 | 50135 | 2469 | 2778 | 3260 | 3443 | 3194 | 3264 | 3021 | 3296 | 3099 | 3070 | 3336 | 2982 | 3323 | 3306 | 2489 | 1763 | 1315 | 727 |
3 | ZCTA5 00606 | 6304 | 312 | 304 | 420 | 479 | 443 | 302 | 329 | 279 | 557 | 457 | 483 | 317 | 555 | 345 | 303 | 171 | 74 | 174 |
4 | ZCTA5 00610 | 27590 | 1256 | 1346 | 1895 | 1939 | 1924 | 1624 | 1575 | 1638 | 1944 | 1937 | 1910 | 1748 | 1744 | 1629 | 1400 | 1140 | 576 | 365 |
5 | ZCTA5 00612 | 62566 | 3260 | 3077 | 4249 | 4397 | 4346 | 3621 | 3586 | 3550 | 4225 | 4312 | 4021 | 3607 | 3903 | 3812 | 3038 | 2551 | 1433 | 1578 |
6 | ZCTA5 00616 | 10687 | 469 | 417 | 405 | 678 | 901 | 784 | 574 | 661 | 687 | 877 | 787 | 558 | 533 | 753 | 712 | 245 | 326 | 320 |
7 | ZCTA5 00617 | 24508 | 1340 | 1401 | 1702 | 1686 | 1709 | 1607 | 1570 | 1719 | 1766 | 1584 | 1434 | 1195 | 1521 | 1390 | 1117 | 814 | 450 | 503 |
8 | ZCTA5 00622 | 7405 | 224 | 296 | 489 | 477 | 428 | 243 | 352 | 382 | 436 | 477 | 579 | 604 | 365 | 519 | 588 | 371 | 334 | 241 |
9 | ZCTA5 00623 | 42321 | 1945 | 2599 | 2457 | 2906 | 2795 | 2528 | 2279 | 2790 | 3075 | 3050 | 2580 | 2298 | 2596 | 2652 | 2073 | 1777 | 1184 | 737 |
Replace the column values which has ‘ZCTA5 00001’ to 00001 using replace function.
df2 = df1.replace('ZCTA5 ','', regex=True)
df2
Geographic Area Name | Estimate!!Total!!Total population | Estimate!!Total!!Total population!!AGE!!Under 5 years | Estimate!!Total!!Total population!!AGE!!5 to 9 years | Estimate!!Total!!Total population!!AGE!!10 to 14 years | Estimate!!Total!!Total population!!AGE!!15 to 19 years | Estimate!!Total!!Total population!!AGE!!20 to 24 years | Estimate!!Total!!Total population!!AGE!!25 to 29 years | Estimate!!Total!!Total population!!AGE!!30 to 34 years | Estimate!!Total!!Total population!!AGE!!35 to 39 years | Estimate!!Total!!Total population!!AGE!!40 to 44 years | Estimate!!Total!!Total population!!AGE!!45 to 49 years | Estimate!!Total!!Total population!!AGE!!50 to 54 years | Estimate!!Total!!Total population!!AGE!!55 to 59 years | Estimate!!Total!!Total population!!AGE!!60 to 64 years | Estimate!!Total!!Total population!!AGE!!65 to 69 years | Estimate!!Total!!Total population!!AGE!!70 to 74 years | Estimate!!Total!!Total population!!AGE!!75 to 79 years | Estimate!!Total!!Total population!!AGE!!80 to 84 years | Estimate!!Total!!Total population!!AGE!!85 years and over | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00601 | 17599 | 938 | 1046 | 1278 | 1250 | 1218 | 1148 | 1073 | 1121 | 918 | 1147 | 1257 | 1215 | 1093 | 1080 | 616 | 524 | 349 | 328 |
1 | 00602 | 39209 | 1794 | 2320 | 2113 | 2706 | 2751 | 2537 | 2350 | 2715 | 2227 | 2749 | 2942 | 2833 | 2810 | 2181 | 1753 | 991 | 789 | 648 |
2 | 00603 | 50135 | 2469 | 2778 | 3260 | 3443 | 3194 | 3264 | 3021 | 3296 | 3099 | 3070 | 3336 | 2982 | 3323 | 3306 | 2489 | 1763 | 1315 | 727 |
3 | 00606 | 6304 | 312 | 304 | 420 | 479 | 443 | 302 | 329 | 279 | 557 | 457 | 483 | 317 | 555 | 345 | 303 | 171 | 74 | 174 |
4 | 00610 | 27590 | 1256 | 1346 | 1895 | 1939 | 1924 | 1624 | 1575 | 1638 | 1944 | 1937 | 1910 | 1748 | 1744 | 1629 | 1400 | 1140 | 576 | 365 |
5 | 00612 | 62566 | 3260 | 3077 | 4249 | 4397 | 4346 | 3621 | 3586 | 3550 | 4225 | 4312 | 4021 | 3607 | 3903 | 3812 | 3038 | 2551 | 1433 | 1578 |
6 | 00616 | 10687 | 469 | 417 | 405 | 678 | 901 | 784 | 574 | 661 | 687 | 877 | 787 | 558 | 533 | 753 | 712 | 245 | 326 | 320 |
7 | 00617 | 24508 | 1340 | 1401 | 1702 | 1686 | 1709 | 1607 | 1570 | 1719 | 1766 | 1584 | 1434 | 1195 | 1521 | 1390 | 1117 | 814 | 450 | 503 |
8 | 00622 | 7405 | 224 | 296 | 489 | 477 | 428 | 243 | 352 | 382 | 436 | 477 | 579 | 604 | 365 | 519 | 588 | 371 | 334 | 241 |
9 | 00623 | 42321 | 1945 | 2599 | 2457 | 2906 | 2795 | 2528 | 2279 | 2790 | 3075 | 3050 | 2580 | 2298 | 2596 | 2652 | 2073 | 1777 | 1184 | 737 |
10 | 00624 | 23238 | 1396 | 1604 | 1562 | 1609 | 1684 | 1651 | 1399 | 1653 | 1108 | 1477 | 1504 | 1602 | 1413 | 1103 | 1079 | 806 | 363 | 225 |
11 | 00627 | 32936 | 1557 | 1770 | 2109 | 2333 | 2377 | 2024 | 1886 | 2366 | 1949 | 2320 | 2269 | 2051 | 2102 | 2022 | 1540 | 1109 | 410 | 742 |
12 | 00631 | 1555 | 34 | 54 | 54 | 75 | 133 | 116 | 103 | 56 | 105 | 78 | 105 | 110 | 138 | 114 | 87 | 68 | 68 | 57 |
13 | 00637 | 23610 | 1210 | 1199 | 1629 | 1665 | 1607 | 1368 | 1309 | 1384 | 1576 | 1461 | 1424 | 1501 | 1432 | 1329 | 1432 | 1056 | 527 | 501 |
14 | 00638 | 17202 | 1025 | 1170 | 993 | 1076 | 1232 | 1077 | 1041 | 930 | 1042 | 993 | 1178 | 1086 | 1305 | 991 | 772 | 552 | 497 | 242 |
15 | 00641 | 28332 | 1359 | 1546 | 1760 | 1975 | 2037 | 1832 | 1532 | 1579 | 1645 | 1766 | 1985 | 1937 | 2080 | 1717 | 1358 | 1092 | 494 | 638 |
16 | 00646 | 38008 | 2033 | 2570 | 2658 | 2730 | 2448 | 2078 | 2240 | 2610 | 3112 | 2767 | 2477 | 2020 | 2349 | 2258 | 1405 | 913 | 843 | 497 |
17 | 00647 | 5610 | 226 | 204 | 264 | 357 | 411 | 241 | 406 | 232 | 280 | 403 | 398 | 332 | 423 | 502 | 357 | 285 | 117 | 172 |
18 | 00650 | 14664 | 848 | 886 | 939 | 1175 | 1147 | 819 | 847 | 903 | 963 | 1058 | 915 | 715 | 876 | 851 | 692 | 460 | 330 | 240 |
Rename the column names to the standard names using rename function.
df2.rename(columns={'Geographic Area Name':'ZCTA5','Estimate!!Total!!Total population':'total_population','Estimate!!Total!!Total population!!AGE!!Under 5 years':'age_5under','Estimate!!Total!!Total population!!AGE!!5 to 9 years':'age_5to9y','Estimate!!Total!!Total population!!AGE!!10 to 14 years':'age_10to14y','Estimate!!Total!!Total population!!AGE!!15 to 19 years':'age_15to19y'}, inplace=True)
df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!20 to 24 years':'age_20to24y','Estimate!!Total!!Total population!!AGE!!25 to 29 years':'age_25to29y','Estimate!!Total!!Total population!!AGE!!30 to 34 years':'age_30to34y','Estimate!!Total!!Total population!!AGE!!35 to 39 years':'age_35to39y','Estimate!!Total!!Total population!!AGE!!40 to 44 years':'age_40to44y','Estimate!!Total!!Total population!!AGE!!45 to 49 years':'age_45to49y'}, inplace=True)
df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!50 to 54 years':'age_50to54y','Estimate!!Total!!Total population!!AGE!!55 to 59 years':'age_55to59y','Estimate!!Total!!Total population!!AGE!!60 to 64 years':'age_60to64y','Estimate!!Total!!Total population!!AGE!!65 to 69 years':'age_65to69y','Estimate!!Total!!Total population!!AGE!!70 to 74 years':'age_70to74y','Estimate!!Total!!Total population!!AGE!!75 to 79 years':'age_75to79y','Estimate!!Total!!Total population!!AGE!!80 to 84 years':'age_80to84y','Estimate!!Total!!Total population!!AGE!!85 years and over':'age_above85y'}, inplace=True)
Drop the duplicates from the dataframe using drop_duplicates function
df3 = df2.drop_duplicates()
df3
ZCTA5 | total_population | age_5under | age_5to9y | age_10to14y | age_15to19y | age_20to24y | age_25to29y | age_30to34y | age_35to39y | age_40to44y | age_45to49y | age_50to54y | age_55to59y | age_60to64y | age_65to69y | age_70to74y | age_75to79y | age_80to84y | age_above85y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00601 | 17599 | 938 | 1046 | 1278 | 1250 | 1218 | 1148 | 1073 | 1121 | 918 | 1147 | 1257 | 1215 | 1093 | 1080 | 616 | 524 | 349 | 328 |
1 | 00602 | 39209 | 1794 | 2320 | 2113 | 2706 | 2751 | 2537 | 2350 | 2715 | 2227 | 2749 | 2942 | 2833 | 2810 | 2181 | 1753 | 991 | 789 | 648 |
2 | 00603 | 50135 | 2469 | 2778 | 3260 | 3443 | 3194 | 3264 | 3021 | 3296 | 3099 | 3070 | 3336 | 2982 | 3323 | 3306 | 2489 | 1763 | 1315 | 727 |
3 | 00606 | 6304 | 312 | 304 | 420 | 479 | 443 | 302 | 329 | 279 | 557 | 457 | 483 | 317 | 555 | 345 | 303 | 171 | 74 | 174 |
4 | 00610 | 27590 | 1256 | 1346 | 1895 | 1939 | 1924 | 1624 | 1575 | 1638 | 1944 | 1937 | 1910 | 1748 | 1744 | 1629 | 1400 | 1140 | 576 | 365 |
5 | 00612 | 62566 | 3260 | 3077 | 4249 | 4397 | 4346 | 3621 | 3586 | 3550 | 4225 | 4312 | 4021 | 3607 | 3903 | 3812 | 3038 | 2551 | 1433 | 1578 |
6 | 00616 | 10687 | 469 | 417 | 405 | 678 | 901 | 784 | 574 | 661 | 687 | 877 | 787 | 558 | 533 | 753 | 712 | 245 | 326 | 320 |
7 | 00617 | 24508 | 1340 | 1401 | 1702 | 1686 | 1709 | 1607 | 1570 | 1719 | 1766 | 1584 | 1434 | 1195 | 1521 | 1390 | 1117 | 814 | 450 | 503 |
8 | 00622 | 7405 | 224 | 296 | 489 | 477 | 428 | 243 | 352 | 382 | 436 | 477 | 579 | 604 | 365 | 519 | 588 | 371 | 334 | 241 |
9 | 00623 | 42321 | 1945 | 2599 | 2457 | 2906 | 2795 | 2528 | 2279 | 2790 | 3075 | 3050 | 2580 | 2298 | 2596 | 2652 | 2073 | 1777 | 1184 | 737 |
10 | 00624 | 23238 | 1396 | 1604 | 1562 | 1609 | 1684 | 1651 | 1399 | 1653 | 1108 | 1477 | 1504 | 1602 | 1413 | 1103 | 1079 | 806 | 363 | 225 |
11 | 00627 | 32936 | 1557 | 1770 | 2109 | 2333 | 2377 | 2024 | 1886 | 2366 | 1949 | 2320 | 2269 | 2051 | 2102 | 2022 | 1540 | 1109 | 410 | 742 |
12 | 00631 | 1555 | 34 | 54 | 54 | 75 | 133 | 116 | 103 | 56 | 105 | 78 | 105 | 110 | 138 | 114 | 87 | 68 | 68 | 57 |
13 | 00637 | 23610 | 1210 | 1199 | 1629 | 1665 | 1607 | 1368 | 1309 | 1384 | 1576 | 1461 | 1424 | 1501 | 1432 | 1329 | 1432 | 1056 | 527 | 501 |
14 | 00638 | 17202 | 1025 | 1170 | 993 | 1076 | 1232 | 1077 | 1041 | 930 | 1042 | 993 | 1178 | 1086 | 1305 | 991 | 772 | 552 | 497 | 242 |
15 | 00641 | 28332 | 1359 | 1546 | 1760 | 1975 | 2037 | 1832 | 1532 | 1579 | 1645 | 1766 | 1985 | 1937 | 2080 | 1717 | 1358 | 1092 | 494 | 638 |
16 | 00646 | 38008 | 2033 | 2570 | 2658 | 2730 | 2448 | 2078 | 2240 | 2610 | 3112 | 2767 | 2477 | 2020 | 2349 | 2258 | 1405 | 913 | 843 | 497 |
17 | 00647 | 5610 | 226 | 204 | 264 | 357 | 411 | 241 | 406 | 232 | 280 | 403 | 398 | 332 | 423 | 502 | 357 | 285 | 117 | 172 |
18 | 00650 | 14664 | 848 | 886 | 939 | 1175 | 1147 | 819 | 847 | 903 | 963 | 1058 | 915 | 715 | 876 | 851 | 692 | 460 | 330 | 240 |
Add new column to the dataframe using insert function for the year which we are loading data.
df3.insert(1, "Year", "2020", True)
df3
ZCTA5 | Year | total_population | age_5under | age_5to9y | age_10to14y | age_15to19y | age_20to24y | age_25to29y | age_30to34y | ... | age_40to44y | age_45to49y | age_50to54y | age_55to59y | age_60to64y | age_65to69y | age_70to74y | age_75to79y | age_80to84y | age_above85y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00601 | 2020 | 17599 | 938 | 1046 | 1278 | 1250 | 1218 | 1148 | 1073 | ... | 918 | 1147 | 1257 | 1215 | 1093 | 1080 | 616 | 524 | 349 | 328 |
1 | 00602 | 2020 | 39209 | 1794 | 2320 | 2113 | 2706 | 2751 | 2537 | 2350 | ... | 2227 | 2749 | 2942 | 2833 | 2810 | 2181 | 1753 | 991 | 789 | 648 |
2 | 00603 | 2020 | 50135 | 2469 | 2778 | 3260 | 3443 | 3194 | 3264 | 3021 | ... | 3099 | 3070 | 3336 | 2982 | 3323 | 3306 | 2489 | 1763 | 1315 | 727 |
3 | 00606 | 2020 | 6304 | 312 | 304 | 420 | 479 | 443 | 302 | 329 | ... | 557 | 457 | 483 | 317 | 555 | 345 | 303 | 171 | 74 | 174 |
4 | 00610 | 2020 | 27590 | 1256 | 1346 | 1895 | 1939 | 1924 | 1624 | 1575 | ... | 1944 | 1937 | 1910 | 1748 | 1744 | 1629 | 1400 | 1140 | 576 | 365 |
5 | 00612 | 2020 | 62566 | 3260 | 3077 | 4249 | 4397 | 4346 | 3621 | 3586 | ... | 4225 | 4312 | 4021 | 3607 | 3903 | 3812 | 3038 | 2551 | 1433 | 1578 |
6 | 00616 | 2020 | 10687 | 469 | 417 | 405 | 678 | 901 | 784 | 574 | ... | 687 | 877 | 787 | 558 | 533 | 753 | 712 | 245 | 326 | 320 |
7 | 00617 | 2020 | 24508 | 1340 | 1401 | 1702 | 1686 | 1709 | 1607 | 1570 | ... | 1766 | 1584 | 1434 | 1195 | 1521 | 1390 | 1117 | 814 | 450 | 503 |
8 | 00622 | 2020 | 7405 | 224 | 296 | 489 | 477 | 428 | 243 | 352 | ... | 436 | 477 | 579 | 604 | 365 | 519 | 588 | 371 | 334 | 241 |
9 | 00623 | 2020 | 42321 | 1945 | 2599 | 2457 | 2906 | 2795 | 2528 | 2279 | ... | 3075 | 3050 | 2580 | 2298 | 2596 | 2652 | 2073 | 1777 | 1184 | 737 |
10 | 00624 | 2020 | 23238 | 1396 | 1604 | 1562 | 1609 | 1684 | 1651 | 1399 | ... | 1108 | 1477 | 1504 | 1602 | 1413 | 1103 | 1079 | 806 | 363 | 225 |
11 | 00627 | 2020 | 32936 | 1557 | 1770 | 2109 | 2333 | 2377 | 2024 | 1886 | ... | 1949 | 2320 | 2269 | 2051 | 2102 | 2022 | 1540 | 1109 | 410 | 742 |
12 | 00631 | 2020 | 1555 | 34 | 54 | 54 | 75 | 133 | 116 | 103 | ... | 105 | 78 | 105 | 110 | 138 | 114 | 87 | 68 | 68 | 57 |
13 | 00637 | 2020 | 23610 | 1210 | 1199 | 1629 | 1665 | 1607 | 1368 | 1309 | ... | 1576 | 1461 | 1424 | 1501 | 1432 | 1329 | 1432 | 1056 | 527 | 501 |
14 | 00638 | 2020 | 17202 | 1025 | 1170 | 993 | 1076 | 1232 | 1077 | 1041 | ... | 1042 | 993 | 1178 | 1086 | 1305 | 991 | 772 | 552 | 497 | 242 |
15 | 00641 | 2020 | 28332 | 1359 | 1546 | 1760 | 1975 | 2037 | 1832 | 1532 | ... | 1645 | 1766 | 1985 | 1937 | 2080 | 1717 | 1358 | 1092 | 494 | 638 |
16 | 00646 | 2020 | 38008 | 2033 | 2570 | 2658 | 2730 | 2448 | 2078 | 2240 | ... | 3112 | 2767 | 2477 | 2020 | 2349 | 2258 | 1405 | 913 | 843 | 497 |
17 | 00647 | 2020 | 5610 | 226 | 204 | 264 | 357 | 411 | 241 | 406 | ... | 280 | 403 | 398 | 332 | 423 | 502 | 357 | 285 | 117 | 172 |
18 | 00650 | 2020 | 14664 | 848 | 886 | 939 | 1175 | 1147 | 819 | 847 | ... | 963 | 1058 | 915 | 715 | 876 | 851 | 692 | 460 | 330 | 240 |
19 rows × 21 columns
If there are files for each year then oncatenate the data for all these years using concat function. Here I just used df3 as I showed only for one year, similarly we can create multiple dataframes for each year and concatenate using the below function
dataframes = [df3]
result = pd.concat(dataframes)
Load the result to the sql server database#
Now, connect to the sql and load the dataframe as new table using SQLAlchemy. It is a sql toolkit to connect and perform all the necessary sql operations from python.
svr is the sql server name
uid is the user id
pwd is the password
db is the database which is USZIPCODE
create_engine will establish python jdbc to the sql
from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'XXXXXX'
pwd = 'XXXXXX'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
result.to_sql('AGE',engine, if_exists='replace')
The above peice of code will create a new table called ‘AGE’ in the ‘USZIPCODE’ database. The table will have all the columns which are present in the results dataframe.
We can validate the data by logging into the sql server directly or we can read the table using read_sql as shown below:
df = pd.read_sql('select * from AGE', engine)
print('***Age data for the ZCTA5***')
print(df)
Use cases#
There are many use cases which can be derived from the age group dataset. Below are few use cases. For all the use cases below, please update the following fields in the below python code before executing
uid - user id
pwd - password
SQL Query - the sql query which needs to be executed
from sqlalchemy import create_engine
import pandas as pd
svr = 'essql1.walton.uark.edu'
uid = 'xxx'
pwd = 'xxx'
db = 'USZIPCODE'
# Create a connection using SQL Alchemy engine
engine = create_engine(f"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes", fast_executemany=True)
# Using pandas, leverage the SQL Alchemy engine that we created above
df = pd.read_sql('SQL Query', engine)
print(df)
Use case 1:#
Find the total population of all age groups across all years and ZCTA5
select year, sum(total_population) as total_population
from [USZIPCODE].[dbo].[AGE]
group by year;
Use case 2:#
Top five ZCTA5 codes where the population is highest
select top(5) zcta5,total_population,year
from [USZIPCODE].[dbo].[AGE]
where year=2020
order by total_population desc;
Use case 3:#
ZCTA5 with the highest teenage population for a given year
select top(1) ZCTA5, total
from ( select ZCTA5,sum(age_10to14y + age_15to19y) as total
from [USZIPCODE].[dbo].[AGE]
where year=2020
group by ZCTA5)
order by total desc;
Use case 4:#
Population growth across years
select year,total_population, total_population-LAG(total_population) OVER (ORDER BY year) AS population_diff
from ( select year, sum(total_population) as total_population
from [USZIPCODE].[dbo].[AGE]
group by year);