{ "cells": [ { "cell_type": "markdown", "id": "5bd00677-eab0-45c0-999b-a89aa239098e", "metadata": {}, "source": [ "# Manipulating census data with pandas\n", "## ZCTA5 - Age Data\n", "\n", "For my project, I choose the Zip Code dataset. I found the dataset from the Census Bureau [age dataset](https://data.census.gov/) \n", "\n", "For the raw data extraction select the above census bureau link and apply the filter 'All 5-digit zip code'. Next, select the code 'S0101AGE AND SEX' from the vast number of available datasets. The data is available from 2010-2020. In this project we have considered the last 4 years 2017-2020. Download the data by selecting the years. Once download is complete unzip the file and there are three files under the zip. First, the metadata file which has all the information about the columns. Second, table notes which describes about the table and finally the actual data which is a csv file. Inside the csv file there are 33,146 records and 434 columns. For this proect we are selecting the subset of columns which cosniders the total population across the age groups." ] }, { "cell_type": "markdown", "id": "8c22c86b-e20a-4255-9b9d-28e4181f7cd1", "metadata": { "tags": [] }, "source": [ "Below are the steps I have followed to transform and load the dataset to the sql database:\n", "\n", "\n", "Import the necessary libraries required:\n", "1) Pandas - It presents a diverse range of utilities, ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array\n", "2) Numpy - NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.\n", "3) Sys - The sys module provides various functions and variables that are used to manipulate different parts of the Python runtime environment." ] }, { "cell_type": "code", "execution_count": 2, "id": "45db09e5-87c4-4035-b186-2dbdac5706db", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "import sys" ] }, { "cell_type": "markdown", "id": "f8b6084e-fa9e-41b2-92d7-baaee56298ca", "metadata": {}, "source": [ "\n", "Load the csv file as dataframe using read_csv" ] }, { "cell_type": "code", "execution_count": 3, "id": "bed52702-df0e-414e-b84b-ce1e363800fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeographyGeographic Area NameEstimate!!Total!!Total populationAnnotation of Estimate!!Total!!Total populationMargin of Error!!Total MOE!!Total populationAnnotation of Margin of Error!!Total MOE!!Total populationEstimate!!Total!!Total population!!AGE!!Under 5 yearsAnnotation of Estimate!!Total!!Total population!!AGE!!Under 5 yearsMargin of Error!!Total MOE!!Total population!!AGE!!Under 5 yearsAnnotation of Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years...Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 yearsAnnotation of Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 yearsEstimate!!Total!!Total population!!AGE!!80 to 84 yearsAnnotation of Estimate!!Total!!Total population!!AGE!!80 to 84 yearsMargin of Error!!Total MOE!!Total population!!AGE!!80 to 84 yearsAnnotation of Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 yearsEstimate!!Total!!Total population!!AGE!!85 years and overMargin of Error!!Total MOE!!Total population!!AGE!!85 years and overAnnotation of Margin of Error!!Total MOE!!Total population!!AGE!!85 years and overAnnotation of Estimate!!Total!!Total population!!AGE!!85 years and over
08600000US00601ZCTA5 0060117599NaN261NaN938NaN40NaN...96NaN349NaN83NaN32895NaNNaN
18600000US00602ZCTA5 0060239209NaN128NaN1794NaN75NaN...159NaN789NaN142NaN648164NaNNaN
28600000US00603ZCTA5 0060350135NaN805NaN2469NaN92NaN...245NaN1315NaN229NaN727158NaNNaN
38600000US00606ZCTA5 006066304NaN255NaN312NaN45NaN...64NaN74NaN48NaN17487NaNNaN
48600000US00610ZCTA5 0061027590NaN163NaN1256NaN4NaN...146NaN576NaN112NaN365115NaNNaN
58600000US00612ZCTA5 0061262566NaN1480NaN3260NaN241NaN...319NaN1433NaN238NaN1578271NaNNaN
68600000US00616ZCTA5 0061610687NaN1192NaN469NaN216NaN...106NaN326NaN160NaN320138NaNNaN
78600000US00617ZCTA5 0061724508NaN296NaN1340NaN25NaN...188NaN450NaN159NaN503177NaNNaN
88600000US00622ZCTA5 006227405NaN1118NaN224NaN140NaN...150NaN334NaN151NaN241129NaNNaN
98600000US00623ZCTA5 0062342321NaN1118NaN1945NaN140NaN...266NaN1184NaN213NaN737189NaNNaN
\n", "

10 rows × 78 columns

\n", "
" ], "text/plain": [ " Geography Geographic Area Name Estimate!!Total!!Total population \\\n", "0 8600000US00601 ZCTA5 00601 17599 \n", "1 8600000US00602 ZCTA5 00602 39209 \n", "2 8600000US00603 ZCTA5 00603 50135 \n", "3 8600000US00606 ZCTA5 00606 6304 \n", "4 8600000US00610 ZCTA5 00610 27590 \n", "5 8600000US00612 ZCTA5 00612 62566 \n", "6 8600000US00616 ZCTA5 00616 10687 \n", "7 8600000US00617 ZCTA5 00617 24508 \n", "8 8600000US00622 ZCTA5 00622 7405 \n", "9 8600000US00623 ZCTA5 00623 42321 \n", "\n", " Annotation of Estimate!!Total!!Total population \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Margin of Error!!Total MOE!!Total population \\\n", "0 261 \n", "1 128 \n", "2 805 \n", "3 255 \n", "4 163 \n", "5 1480 \n", "6 1192 \n", "7 296 \n", "8 1118 \n", "9 1118 \n", "\n", " Annotation of Margin of Error!!Total MOE!!Total population \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Estimate!!Total!!Total population!!AGE!!Under 5 years \\\n", "0 938 \n", "1 1794 \n", "2 2469 \n", "3 312 \n", "4 1256 \n", "5 3260 \n", "6 469 \n", "7 1340 \n", "8 224 \n", "9 1945 \n", "\n", " Annotation of Estimate!!Total!!Total population!!AGE!!Under 5 years \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years \\\n", "0 40 \n", "1 75 \n", "2 92 \n", "3 45 \n", "4 4 \n", "5 241 \n", "6 216 \n", "7 25 \n", "8 140 \n", "9 140 \n", "\n", " Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!Under 5 years \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " ... Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years \\\n", "0 ... 96 \n", "1 ... 159 \n", "2 ... 245 \n", "3 ... 64 \n", "4 ... 146 \n", "5 ... 319 \n", "6 ... 106 \n", "7 ... 188 \n", "8 ... 150 \n", "9 ... 266 \n", "\n", " Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!75 to 79 years \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Estimate!!Total!!Total population!!AGE!!80 to 84 years \\\n", "0 349 \n", "1 789 \n", "2 1315 \n", "3 74 \n", "4 576 \n", "5 1433 \n", "6 326 \n", "7 450 \n", "8 334 \n", "9 1184 \n", "\n", " Annotation of Estimate!!Total!!Total population!!AGE!!80 to 84 years \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years \\\n", "0 83 \n", "1 142 \n", "2 229 \n", "3 48 \n", "4 112 \n", "5 238 \n", "6 160 \n", "7 159 \n", "8 151 \n", "9 213 \n", "\n", " Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!80 to 84 years \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Estimate!!Total!!Total population!!AGE!!85 years and over \\\n", "0 328 \n", "1 648 \n", "2 727 \n", "3 174 \n", "4 365 \n", "5 1578 \n", "6 320 \n", "7 503 \n", "8 241 \n", "9 737 \n", "\n", " Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over \\\n", "0 95 \n", "1 164 \n", "2 158 \n", "3 87 \n", "4 115 \n", "5 271 \n", "6 138 \n", "7 177 \n", "8 129 \n", "9 189 \n", "\n", " Annotation of Margin of Error!!Total MOE!!Total population!!AGE!!85 years and over \\\n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", " Annotation of Estimate!!Total!!Total population!!AGE!!85 years and over \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", "[10 rows x 78 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('contrib_data/Age_Data_Zip.csv') #read the file\n", "df.head(10) #print first 10 records" ] }, { "cell_type": "markdown", "id": "e790cfad-2443-46a9-9287-94458585a2e6", "metadata": {}, "source": [ "\n", "Select the necessary columns from the dataset using iloc -> Integer-location based indexing for selection by position." ] }, { "cell_type": "code", "execution_count": 3, "id": "b9e485fb-697f-4423-a292-dd83947f94bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geographic Area NameEstimate!!Total!!Total populationEstimate!!Total!!Total population!!AGE!!Under 5 yearsEstimate!!Total!!Total population!!AGE!!5 to 9 yearsEstimate!!Total!!Total population!!AGE!!10 to 14 yearsEstimate!!Total!!Total population!!AGE!!15 to 19 yearsEstimate!!Total!!Total population!!AGE!!20 to 24 yearsEstimate!!Total!!Total population!!AGE!!25 to 29 yearsEstimate!!Total!!Total population!!AGE!!30 to 34 yearsEstimate!!Total!!Total population!!AGE!!35 to 39 yearsEstimate!!Total!!Total population!!AGE!!40 to 44 yearsEstimate!!Total!!Total population!!AGE!!45 to 49 yearsEstimate!!Total!!Total population!!AGE!!50 to 54 yearsEstimate!!Total!!Total population!!AGE!!55 to 59 yearsEstimate!!Total!!Total population!!AGE!!60 to 64 yearsEstimate!!Total!!Total population!!AGE!!65 to 69 yearsEstimate!!Total!!Total population!!AGE!!70 to 74 yearsEstimate!!Total!!Total population!!AGE!!75 to 79 yearsEstimate!!Total!!Total population!!AGE!!80 to 84 yearsEstimate!!Total!!Total population!!AGE!!85 years and over
0ZCTA5 0060117599938104612781250121811481073112191811471257121510931080616524349328
1ZCTA5 0060239209179423202113270627512537235027152227274929422833281021811753991789648
2ZCTA5 006035013524692778326034433194326430213296309930703336298233233306248917631315727
3ZCTA5 00606630431230442047944330232927955745748331755534530317174174
4ZCTA5 00610275901256134618951939192416241575163819441937191017481744162914001140576365
5ZCTA5 0061262566326030774249439743463621358635504225431240213607390338123038255114331578
6ZCTA5 0061610687469417405678901784574661687877787558533753712245326320
7ZCTA5 0061724508134014011702168617091607157017191766158414341195152113901117814450503
8ZCTA5 006227405224296489477428243352382436477579604365519588371334241
9ZCTA5 006234232119452599245729062795252822792790307530502580229825962652207317771184737
\n", "
" ], "text/plain": [ " Geographic Area Name Estimate!!Total!!Total population \\\n", "0 ZCTA5 00601 17599 \n", "1 ZCTA5 00602 39209 \n", "2 ZCTA5 00603 50135 \n", "3 ZCTA5 00606 6304 \n", "4 ZCTA5 00610 27590 \n", "5 ZCTA5 00612 62566 \n", "6 ZCTA5 00616 10687 \n", "7 ZCTA5 00617 24508 \n", "8 ZCTA5 00622 7405 \n", "9 ZCTA5 00623 42321 \n", "\n", " Estimate!!Total!!Total population!!AGE!!Under 5 years \\\n", "0 938 \n", "1 1794 \n", "2 2469 \n", "3 312 \n", "4 1256 \n", "5 3260 \n", "6 469 \n", "7 1340 \n", "8 224 \n", "9 1945 \n", "\n", " Estimate!!Total!!Total population!!AGE!!5 to 9 years \\\n", "0 1046 \n", "1 2320 \n", "2 2778 \n", "3 304 \n", "4 1346 \n", "5 3077 \n", "6 417 \n", "7 1401 \n", "8 296 \n", "9 2599 \n", "\n", " Estimate!!Total!!Total population!!AGE!!10 to 14 years \\\n", "0 1278 \n", "1 2113 \n", "2 3260 \n", "3 420 \n", "4 1895 \n", "5 4249 \n", "6 405 \n", "7 1702 \n", "8 489 \n", "9 2457 \n", "\n", " Estimate!!Total!!Total population!!AGE!!15 to 19 years \\\n", "0 1250 \n", "1 2706 \n", "2 3443 \n", "3 479 \n", "4 1939 \n", "5 4397 \n", "6 678 \n", "7 1686 \n", "8 477 \n", "9 2906 \n", "\n", " Estimate!!Total!!Total population!!AGE!!20 to 24 years \\\n", "0 1218 \n", "1 2751 \n", "2 3194 \n", "3 443 \n", "4 1924 \n", "5 4346 \n", "6 901 \n", "7 1709 \n", "8 428 \n", "9 2795 \n", "\n", " Estimate!!Total!!Total population!!AGE!!25 to 29 years \\\n", "0 1148 \n", "1 2537 \n", "2 3264 \n", "3 302 \n", "4 1624 \n", "5 3621 \n", "6 784 \n", "7 1607 \n", "8 243 \n", "9 2528 \n", "\n", " Estimate!!Total!!Total population!!AGE!!30 to 34 years \\\n", "0 1073 \n", "1 2350 \n", "2 3021 \n", "3 329 \n", "4 1575 \n", "5 3586 \n", "6 574 \n", "7 1570 \n", "8 352 \n", "9 2279 \n", "\n", " Estimate!!Total!!Total population!!AGE!!35 to 39 years \\\n", "0 1121 \n", "1 2715 \n", "2 3296 \n", "3 279 \n", "4 1638 \n", "5 3550 \n", "6 661 \n", "7 1719 \n", "8 382 \n", "9 2790 \n", "\n", " Estimate!!Total!!Total population!!AGE!!40 to 44 years \\\n", "0 918 \n", "1 2227 \n", "2 3099 \n", "3 557 \n", "4 1944 \n", "5 4225 \n", "6 687 \n", "7 1766 \n", "8 436 \n", "9 3075 \n", "\n", " Estimate!!Total!!Total population!!AGE!!45 to 49 years \\\n", "0 1147 \n", "1 2749 \n", "2 3070 \n", "3 457 \n", "4 1937 \n", "5 4312 \n", "6 877 \n", "7 1584 \n", "8 477 \n", "9 3050 \n", "\n", " Estimate!!Total!!Total population!!AGE!!50 to 54 years \\\n", "0 1257 \n", "1 2942 \n", "2 3336 \n", "3 483 \n", "4 1910 \n", "5 4021 \n", "6 787 \n", "7 1434 \n", "8 579 \n", "9 2580 \n", "\n", " Estimate!!Total!!Total population!!AGE!!55 to 59 years \\\n", "0 1215 \n", "1 2833 \n", "2 2982 \n", "3 317 \n", "4 1748 \n", "5 3607 \n", "6 558 \n", "7 1195 \n", "8 604 \n", "9 2298 \n", "\n", " Estimate!!Total!!Total population!!AGE!!60 to 64 years \\\n", "0 1093 \n", "1 2810 \n", "2 3323 \n", "3 555 \n", "4 1744 \n", "5 3903 \n", "6 533 \n", "7 1521 \n", "8 365 \n", "9 2596 \n", "\n", " Estimate!!Total!!Total population!!AGE!!65 to 69 years \\\n", "0 1080 \n", "1 2181 \n", "2 3306 \n", "3 345 \n", "4 1629 \n", "5 3812 \n", "6 753 \n", "7 1390 \n", "8 519 \n", "9 2652 \n", "\n", " Estimate!!Total!!Total population!!AGE!!70 to 74 years \\\n", "0 616 \n", "1 1753 \n", "2 2489 \n", "3 303 \n", "4 1400 \n", "5 3038 \n", "6 712 \n", "7 1117 \n", "8 588 \n", "9 2073 \n", "\n", " Estimate!!Total!!Total population!!AGE!!75 to 79 years \\\n", "0 524 \n", "1 991 \n", "2 1763 \n", "3 171 \n", "4 1140 \n", "5 2551 \n", "6 245 \n", "7 814 \n", "8 371 \n", "9 1777 \n", "\n", " Estimate!!Total!!Total population!!AGE!!80 to 84 years \\\n", "0 349 \n", "1 789 \n", "2 1315 \n", "3 74 \n", "4 576 \n", "5 1433 \n", "6 326 \n", "7 450 \n", "8 334 \n", "9 1184 \n", "\n", " Estimate!!Total!!Total population!!AGE!!85 years and over \n", "0 328 \n", "1 648 \n", "2 727 \n", "3 174 \n", "4 365 \n", "5 1578 \n", "6 320 \n", "7 503 \n", "8 241 \n", "9 737 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.iloc[:,[1,2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74]] \n", "\n", "df1.head(10) #print first 10 records" ] }, { "cell_type": "markdown", "id": "2dbf366d-6d5e-4478-9ee9-0153bc3433c5", "metadata": {}, "source": [ "Replace the column values which has 'ZCTA5 00001' to 00001 using replace function." ] }, { "cell_type": "code", "execution_count": 4, "id": "ee0cf0b2-3e67-4bfc-a985-b29231230d05", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Geographic Area NameEstimate!!Total!!Total populationEstimate!!Total!!Total population!!AGE!!Under 5 yearsEstimate!!Total!!Total population!!AGE!!5 to 9 yearsEstimate!!Total!!Total population!!AGE!!10 to 14 yearsEstimate!!Total!!Total population!!AGE!!15 to 19 yearsEstimate!!Total!!Total population!!AGE!!20 to 24 yearsEstimate!!Total!!Total population!!AGE!!25 to 29 yearsEstimate!!Total!!Total population!!AGE!!30 to 34 yearsEstimate!!Total!!Total population!!AGE!!35 to 39 yearsEstimate!!Total!!Total population!!AGE!!40 to 44 yearsEstimate!!Total!!Total population!!AGE!!45 to 49 yearsEstimate!!Total!!Total population!!AGE!!50 to 54 yearsEstimate!!Total!!Total population!!AGE!!55 to 59 yearsEstimate!!Total!!Total population!!AGE!!60 to 64 yearsEstimate!!Total!!Total population!!AGE!!65 to 69 yearsEstimate!!Total!!Total population!!AGE!!70 to 74 yearsEstimate!!Total!!Total population!!AGE!!75 to 79 yearsEstimate!!Total!!Total population!!AGE!!80 to 84 yearsEstimate!!Total!!Total population!!AGE!!85 years and over
00060117599938104612781250121811481073112191811471257121510931080616524349328
10060239209179423202113270627512537235027152227274929422833281021811753991789648
2006035013524692778326034433194326430213296309930703336298233233306248917631315727
300606630431230442047944330232927955745748331755534530317174174
400610275901256134618951939192416241575163819441937191017481744162914001140576365
50061262566326030774249439743463621358635504225431240213607390338123038255114331578
60061610687469417405678901784574661687877787558533753712245326320
70061724508134014011702168617091607157017191766158414341195152113901117814450503
8006227405224296489477428243352382436477579604365519588371334241
9006234232119452599245729062795252822792790307530502580229825962652207317771184737
100062423238139616041562160916841651139916531108147715041602141311031079806363225
1100627329361557177021092333237720241886236619492320226920512102202215401109410742
1200631155534545475133116103561057810511013811487686857
1300637236101210119916291665160713681309138415761461142415011432132914321056527501
1400638172021025117099310761232107710419301042993117810861305991772552497242
1500641283321359154617601975203718321532157916451766198519372080171713581092494638
160064638008203325702658273024482078224026103112276724772020234922581405913843497
17006475610226204264357411241406232280403398332423502357285117172
180065014664848886939117511478198479039631058915715876851692460330240
\n", "
" ], "text/plain": [ " Geographic Area Name Estimate!!Total!!Total population \\\n", "0 00601 17599 \n", "1 00602 39209 \n", "2 00603 50135 \n", "3 00606 6304 \n", "4 00610 27590 \n", "5 00612 62566 \n", "6 00616 10687 \n", "7 00617 24508 \n", "8 00622 7405 \n", "9 00623 42321 \n", "10 00624 23238 \n", "11 00627 32936 \n", "12 00631 1555 \n", "13 00637 23610 \n", "14 00638 17202 \n", "15 00641 28332 \n", "16 00646 38008 \n", "17 00647 5610 \n", "18 00650 14664 \n", "\n", " Estimate!!Total!!Total population!!AGE!!Under 5 years \\\n", "0 938 \n", "1 1794 \n", "2 2469 \n", "3 312 \n", "4 1256 \n", "5 3260 \n", "6 469 \n", "7 1340 \n", "8 224 \n", "9 1945 \n", "10 1396 \n", "11 1557 \n", "12 34 \n", "13 1210 \n", "14 1025 \n", "15 1359 \n", "16 2033 \n", "17 226 \n", "18 848 \n", "\n", " Estimate!!Total!!Total population!!AGE!!5 to 9 years \\\n", "0 1046 \n", "1 2320 \n", "2 2778 \n", "3 304 \n", "4 1346 \n", "5 3077 \n", "6 417 \n", "7 1401 \n", "8 296 \n", "9 2599 \n", "10 1604 \n", "11 1770 \n", "12 54 \n", "13 1199 \n", "14 1170 \n", "15 1546 \n", "16 2570 \n", "17 204 \n", "18 886 \n", "\n", " Estimate!!Total!!Total population!!AGE!!10 to 14 years \\\n", "0 1278 \n", "1 2113 \n", "2 3260 \n", "3 420 \n", "4 1895 \n", "5 4249 \n", "6 405 \n", "7 1702 \n", "8 489 \n", "9 2457 \n", "10 1562 \n", "11 2109 \n", "12 54 \n", "13 1629 \n", "14 993 \n", "15 1760 \n", "16 2658 \n", "17 264 \n", "18 939 \n", "\n", " Estimate!!Total!!Total population!!AGE!!15 to 19 years \\\n", "0 1250 \n", "1 2706 \n", "2 3443 \n", "3 479 \n", "4 1939 \n", "5 4397 \n", "6 678 \n", "7 1686 \n", "8 477 \n", "9 2906 \n", "10 1609 \n", "11 2333 \n", "12 75 \n", "13 1665 \n", "14 1076 \n", "15 1975 \n", "16 2730 \n", "17 357 \n", "18 1175 \n", "\n", " Estimate!!Total!!Total population!!AGE!!20 to 24 years \\\n", "0 1218 \n", "1 2751 \n", "2 3194 \n", "3 443 \n", "4 1924 \n", "5 4346 \n", "6 901 \n", "7 1709 \n", "8 428 \n", "9 2795 \n", "10 1684 \n", "11 2377 \n", "12 133 \n", "13 1607 \n", "14 1232 \n", "15 2037 \n", "16 2448 \n", "17 411 \n", "18 1147 \n", "\n", " Estimate!!Total!!Total population!!AGE!!25 to 29 years \\\n", "0 1148 \n", "1 2537 \n", "2 3264 \n", "3 302 \n", "4 1624 \n", "5 3621 \n", "6 784 \n", "7 1607 \n", "8 243 \n", "9 2528 \n", "10 1651 \n", "11 2024 \n", "12 116 \n", "13 1368 \n", "14 1077 \n", "15 1832 \n", "16 2078 \n", "17 241 \n", "18 819 \n", "\n", " Estimate!!Total!!Total population!!AGE!!30 to 34 years \\\n", "0 1073 \n", "1 2350 \n", "2 3021 \n", "3 329 \n", "4 1575 \n", "5 3586 \n", "6 574 \n", "7 1570 \n", "8 352 \n", "9 2279 \n", "10 1399 \n", "11 1886 \n", "12 103 \n", "13 1309 \n", "14 1041 \n", "15 1532 \n", "16 2240 \n", "17 406 \n", "18 847 \n", "\n", " Estimate!!Total!!Total population!!AGE!!35 to 39 years \\\n", "0 1121 \n", "1 2715 \n", "2 3296 \n", "3 279 \n", "4 1638 \n", "5 3550 \n", "6 661 \n", "7 1719 \n", "8 382 \n", "9 2790 \n", "10 1653 \n", "11 2366 \n", "12 56 \n", "13 1384 \n", "14 930 \n", "15 1579 \n", "16 2610 \n", "17 232 \n", "18 903 \n", "\n", " Estimate!!Total!!Total population!!AGE!!40 to 44 years \\\n", "0 918 \n", "1 2227 \n", "2 3099 \n", "3 557 \n", "4 1944 \n", "5 4225 \n", "6 687 \n", "7 1766 \n", "8 436 \n", "9 3075 \n", "10 1108 \n", "11 1949 \n", "12 105 \n", "13 1576 \n", "14 1042 \n", "15 1645 \n", "16 3112 \n", "17 280 \n", "18 963 \n", "\n", " Estimate!!Total!!Total population!!AGE!!45 to 49 years \\\n", "0 1147 \n", "1 2749 \n", "2 3070 \n", "3 457 \n", "4 1937 \n", "5 4312 \n", "6 877 \n", "7 1584 \n", "8 477 \n", "9 3050 \n", "10 1477 \n", "11 2320 \n", "12 78 \n", "13 1461 \n", "14 993 \n", "15 1766 \n", "16 2767 \n", "17 403 \n", "18 1058 \n", "\n", " Estimate!!Total!!Total population!!AGE!!50 to 54 years \\\n", "0 1257 \n", "1 2942 \n", "2 3336 \n", "3 483 \n", "4 1910 \n", "5 4021 \n", "6 787 \n", "7 1434 \n", "8 579 \n", "9 2580 \n", "10 1504 \n", "11 2269 \n", "12 105 \n", "13 1424 \n", "14 1178 \n", "15 1985 \n", "16 2477 \n", "17 398 \n", "18 915 \n", "\n", " Estimate!!Total!!Total population!!AGE!!55 to 59 years \\\n", "0 1215 \n", "1 2833 \n", "2 2982 \n", "3 317 \n", "4 1748 \n", "5 3607 \n", "6 558 \n", "7 1195 \n", "8 604 \n", "9 2298 \n", "10 1602 \n", "11 2051 \n", "12 110 \n", "13 1501 \n", "14 1086 \n", "15 1937 \n", "16 2020 \n", "17 332 \n", "18 715 \n", "\n", " Estimate!!Total!!Total population!!AGE!!60 to 64 years \\\n", "0 1093 \n", "1 2810 \n", "2 3323 \n", "3 555 \n", "4 1744 \n", "5 3903 \n", "6 533 \n", "7 1521 \n", "8 365 \n", "9 2596 \n", "10 1413 \n", "11 2102 \n", "12 138 \n", "13 1432 \n", "14 1305 \n", "15 2080 \n", "16 2349 \n", "17 423 \n", "18 876 \n", "\n", " Estimate!!Total!!Total population!!AGE!!65 to 69 years \\\n", "0 1080 \n", "1 2181 \n", "2 3306 \n", "3 345 \n", "4 1629 \n", "5 3812 \n", "6 753 \n", "7 1390 \n", "8 519 \n", "9 2652 \n", "10 1103 \n", "11 2022 \n", "12 114 \n", "13 1329 \n", "14 991 \n", "15 1717 \n", "16 2258 \n", "17 502 \n", "18 851 \n", "\n", " Estimate!!Total!!Total population!!AGE!!70 to 74 years \\\n", "0 616 \n", "1 1753 \n", "2 2489 \n", "3 303 \n", "4 1400 \n", "5 3038 \n", "6 712 \n", "7 1117 \n", "8 588 \n", "9 2073 \n", "10 1079 \n", "11 1540 \n", "12 87 \n", "13 1432 \n", "14 772 \n", "15 1358 \n", "16 1405 \n", "17 357 \n", "18 692 \n", "\n", " Estimate!!Total!!Total population!!AGE!!75 to 79 years \\\n", "0 524 \n", "1 991 \n", "2 1763 \n", "3 171 \n", "4 1140 \n", "5 2551 \n", "6 245 \n", "7 814 \n", "8 371 \n", "9 1777 \n", "10 806 \n", "11 1109 \n", "12 68 \n", "13 1056 \n", "14 552 \n", "15 1092 \n", "16 913 \n", "17 285 \n", "18 460 \n", "\n", " Estimate!!Total!!Total population!!AGE!!80 to 84 years \\\n", "0 349 \n", "1 789 \n", "2 1315 \n", "3 74 \n", "4 576 \n", "5 1433 \n", "6 326 \n", "7 450 \n", "8 334 \n", "9 1184 \n", "10 363 \n", "11 410 \n", "12 68 \n", "13 527 \n", "14 497 \n", "15 494 \n", "16 843 \n", "17 117 \n", "18 330 \n", "\n", " Estimate!!Total!!Total population!!AGE!!85 years and over \n", "0 328 \n", "1 648 \n", "2 727 \n", "3 174 \n", "4 365 \n", "5 1578 \n", "6 320 \n", "7 503 \n", "8 241 \n", "9 737 \n", "10 225 \n", "11 742 \n", "12 57 \n", "13 501 \n", "14 242 \n", "15 638 \n", "16 497 \n", "17 172 \n", "18 240 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df1.replace('ZCTA5 ','', regex=True) \n", "\n", "df2" ] }, { "cell_type": "markdown", "id": "1031ec69-0831-4769-9312-888da1fb26bd", "metadata": {}, "source": [ "Rename the column names to the standard names using rename function." ] }, { "cell_type": "code", "execution_count": 5, "id": "f14aac09-e55e-4a29-bd31-8c965c51cec4", "metadata": {}, "outputs": [], "source": [ "df2.rename(columns={'Geographic Area Name':'ZCTA5','Estimate!!Total!!Total population':'total_population','Estimate!!Total!!Total population!!AGE!!Under 5 years':'age_5under','Estimate!!Total!!Total population!!AGE!!5 to 9 years':'age_5to9y','Estimate!!Total!!Total population!!AGE!!10 to 14 years':'age_10to14y','Estimate!!Total!!Total population!!AGE!!15 to 19 years':'age_15to19y'}, inplace=True)\n", "\n", "df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!20 to 24 years':'age_20to24y','Estimate!!Total!!Total population!!AGE!!25 to 29 years':'age_25to29y','Estimate!!Total!!Total population!!AGE!!30 to 34 years':'age_30to34y','Estimate!!Total!!Total population!!AGE!!35 to 39 years':'age_35to39y','Estimate!!Total!!Total population!!AGE!!40 to 44 years':'age_40to44y','Estimate!!Total!!Total population!!AGE!!45 to 49 years':'age_45to49y'}, inplace=True)\n", "\n", "df2.rename(columns={'Estimate!!Total!!Total population!!AGE!!50 to 54 years':'age_50to54y','Estimate!!Total!!Total population!!AGE!!55 to 59 years':'age_55to59y','Estimate!!Total!!Total population!!AGE!!60 to 64 years':'age_60to64y','Estimate!!Total!!Total population!!AGE!!65 to 69 years':'age_65to69y','Estimate!!Total!!Total population!!AGE!!70 to 74 years':'age_70to74y','Estimate!!Total!!Total population!!AGE!!75 to 79 years':'age_75to79y','Estimate!!Total!!Total population!!AGE!!80 to 84 years':'age_80to84y','Estimate!!Total!!Total population!!AGE!!85 years and over':'age_above85y'}, inplace=True)" ] }, { "cell_type": "markdown", "id": "0b5e137c-6cb9-4b02-8426-ba5a7a090e63", "metadata": {}, "source": [ "Drop the duplicates from the dataframe using drop_duplicates function" ] }, { "cell_type": "code", "execution_count": 6, "id": "dd592397-ec50-40b8-8297-278ae6f35586", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5total_populationage_5underage_5to9yage_10to14yage_15to19yage_20to24yage_25to29yage_30to34yage_35to39yage_40to44yage_45to49yage_50to54yage_55to59yage_60to64yage_65to69yage_70to74yage_75to79yage_80to84yage_above85y
00060117599938104612781250121811481073112191811471257121510931080616524349328
10060239209179423202113270627512537235027152227274929422833281021811753991789648
2006035013524692778326034433194326430213296309930703336298233233306248917631315727
300606630431230442047944330232927955745748331755534530317174174
400610275901256134618951939192416241575163819441937191017481744162914001140576365
50061262566326030774249439743463621358635504225431240213607390338123038255114331578
60061610687469417405678901784574661687877787558533753712245326320
70061724508134014011702168617091607157017191766158414341195152113901117814450503
8006227405224296489477428243352382436477579604365519588371334241
9006234232119452599245729062795252822792790307530502580229825962652207317771184737
100062423238139616041562160916841651139916531108147715041602141311031079806363225
1100627329361557177021092333237720241886236619492320226920512102202215401109410742
1200631155534545475133116103561057810511013811487686857
1300637236101210119916291665160713681309138415761461142415011432132914321056527501
1400638172021025117099310761232107710419301042993117810861305991772552497242
1500641283321359154617601975203718321532157916451766198519372080171713581092494638
160064638008203325702658273024482078224026103112276724772020234922581405913843497
17006475610226204264357411241406232280403398332423502357285117172
180065014664848886939117511478198479039631058915715876851692460330240
\n", "
" ], "text/plain": [ " ZCTA5 total_population age_5under age_5to9y age_10to14y age_15to19y \\\n", "0 00601 17599 938 1046 1278 1250 \n", "1 00602 39209 1794 2320 2113 2706 \n", "2 00603 50135 2469 2778 3260 3443 \n", "3 00606 6304 312 304 420 479 \n", "4 00610 27590 1256 1346 1895 1939 \n", "5 00612 62566 3260 3077 4249 4397 \n", "6 00616 10687 469 417 405 678 \n", "7 00617 24508 1340 1401 1702 1686 \n", "8 00622 7405 224 296 489 477 \n", "9 00623 42321 1945 2599 2457 2906 \n", "10 00624 23238 1396 1604 1562 1609 \n", "11 00627 32936 1557 1770 2109 2333 \n", "12 00631 1555 34 54 54 75 \n", "13 00637 23610 1210 1199 1629 1665 \n", "14 00638 17202 1025 1170 993 1076 \n", "15 00641 28332 1359 1546 1760 1975 \n", "16 00646 38008 2033 2570 2658 2730 \n", "17 00647 5610 226 204 264 357 \n", "18 00650 14664 848 886 939 1175 \n", "\n", " age_20to24y age_25to29y age_30to34y age_35to39y age_40to44y \\\n", "0 1218 1148 1073 1121 918 \n", "1 2751 2537 2350 2715 2227 \n", "2 3194 3264 3021 3296 3099 \n", "3 443 302 329 279 557 \n", "4 1924 1624 1575 1638 1944 \n", "5 4346 3621 3586 3550 4225 \n", "6 901 784 574 661 687 \n", "7 1709 1607 1570 1719 1766 \n", "8 428 243 352 382 436 \n", "9 2795 2528 2279 2790 3075 \n", "10 1684 1651 1399 1653 1108 \n", "11 2377 2024 1886 2366 1949 \n", "12 133 116 103 56 105 \n", "13 1607 1368 1309 1384 1576 \n", "14 1232 1077 1041 930 1042 \n", "15 2037 1832 1532 1579 1645 \n", "16 2448 2078 2240 2610 3112 \n", "17 411 241 406 232 280 \n", "18 1147 819 847 903 963 \n", "\n", " age_45to49y age_50to54y age_55to59y age_60to64y age_65to69y \\\n", "0 1147 1257 1215 1093 1080 \n", "1 2749 2942 2833 2810 2181 \n", "2 3070 3336 2982 3323 3306 \n", "3 457 483 317 555 345 \n", "4 1937 1910 1748 1744 1629 \n", "5 4312 4021 3607 3903 3812 \n", "6 877 787 558 533 753 \n", "7 1584 1434 1195 1521 1390 \n", "8 477 579 604 365 519 \n", "9 3050 2580 2298 2596 2652 \n", "10 1477 1504 1602 1413 1103 \n", "11 2320 2269 2051 2102 2022 \n", "12 78 105 110 138 114 \n", "13 1461 1424 1501 1432 1329 \n", "14 993 1178 1086 1305 991 \n", "15 1766 1985 1937 2080 1717 \n", "16 2767 2477 2020 2349 2258 \n", "17 403 398 332 423 502 \n", "18 1058 915 715 876 851 \n", "\n", " age_70to74y age_75to79y age_80to84y age_above85y \n", "0 616 524 349 328 \n", "1 1753 991 789 648 \n", "2 2489 1763 1315 727 \n", "3 303 171 74 174 \n", "4 1400 1140 576 365 \n", "5 3038 2551 1433 1578 \n", "6 712 245 326 320 \n", "7 1117 814 450 503 \n", "8 588 371 334 241 \n", "9 2073 1777 1184 737 \n", "10 1079 806 363 225 \n", "11 1540 1109 410 742 \n", "12 87 68 68 57 \n", "13 1432 1056 527 501 \n", "14 772 552 497 242 \n", "15 1358 1092 494 638 \n", "16 1405 913 843 497 \n", "17 357 285 117 172 \n", "18 692 460 330 240 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df2.drop_duplicates()\n", "\n", "df3" ] }, { "cell_type": "markdown", "id": "fe8c6a6d-5020-4dd8-9481-ab9665e17f8b", "metadata": {}, "source": [ "Add new column to the dataframe using insert function for the year which we are loading data." ] }, { "cell_type": "code", "execution_count": 7, "id": "8442ce68-2858-4d4a-9d41-331f4323d1eb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5Yeartotal_populationage_5underage_5to9yage_10to14yage_15to19yage_20to24yage_25to29yage_30to34y...age_40to44yage_45to49yage_50to54yage_55to59yage_60to64yage_65to69yage_70to74yage_75to79yage_80to84yage_above85y
000601202017599938104612781250121811481073...91811471257121510931080616524349328
1006022020392091794232021132706275125372350...2227274929422833281021811753991789648
2006032020501352469277832603443319432643021...309930703336298233233306248917631315727
30060620206304312304420479443302329...55745748331755534530317174174
4006102020275901256134618951939192416241575...19441937191017481744162914001140576365
5006122020625663260307742494397434636213586...4225431240213607390338123038255114331578
600616202010687469417405678901784574...687877787558533753712245326320
7006172020245081340140117021686170916071570...1766158414341195152113901117814450503
80062220207405224296489477428243352...436477579604365519588371334241
9006232020423211945259924572906279525282279...307530502580229825962652207317771184737
10006242020232381396160415621609168416511399...1108147715041602141311031079806363225
11006272020329361557177021092333237720241886...19492320226920512102202215401109410742
12006312020155534545475133116103...1057810511013811487686857
13006372020236101210119916291665160713681309...15761461142415011432132914321056527501
1400638202017202102511709931076123210771041...1042993117810861305991772552497242
15006412020283321359154617601975203718321532...16451766198519372080171713581092494638
16006462020380082033257026582730244820782240...3112276724772020234922581405913843497
170064720205610226204264357411241406...280403398332423502357285117172
180065020201466484888693911751147819847...9631058915715876851692460330240
\n", "

19 rows × 21 columns

\n", "
" ], "text/plain": [ " ZCTA5 Year total_population age_5under age_5to9y age_10to14y \\\n", "0 00601 2020 17599 938 1046 1278 \n", "1 00602 2020 39209 1794 2320 2113 \n", "2 00603 2020 50135 2469 2778 3260 \n", "3 00606 2020 6304 312 304 420 \n", "4 00610 2020 27590 1256 1346 1895 \n", "5 00612 2020 62566 3260 3077 4249 \n", "6 00616 2020 10687 469 417 405 \n", "7 00617 2020 24508 1340 1401 1702 \n", "8 00622 2020 7405 224 296 489 \n", "9 00623 2020 42321 1945 2599 2457 \n", "10 00624 2020 23238 1396 1604 1562 \n", "11 00627 2020 32936 1557 1770 2109 \n", "12 00631 2020 1555 34 54 54 \n", "13 00637 2020 23610 1210 1199 1629 \n", "14 00638 2020 17202 1025 1170 993 \n", "15 00641 2020 28332 1359 1546 1760 \n", "16 00646 2020 38008 2033 2570 2658 \n", "17 00647 2020 5610 226 204 264 \n", "18 00650 2020 14664 848 886 939 \n", "\n", " age_15to19y age_20to24y age_25to29y age_30to34y ... age_40to44y \\\n", "0 1250 1218 1148 1073 ... 918 \n", "1 2706 2751 2537 2350 ... 2227 \n", "2 3443 3194 3264 3021 ... 3099 \n", "3 479 443 302 329 ... 557 \n", "4 1939 1924 1624 1575 ... 1944 \n", "5 4397 4346 3621 3586 ... 4225 \n", "6 678 901 784 574 ... 687 \n", "7 1686 1709 1607 1570 ... 1766 \n", "8 477 428 243 352 ... 436 \n", "9 2906 2795 2528 2279 ... 3075 \n", "10 1609 1684 1651 1399 ... 1108 \n", "11 2333 2377 2024 1886 ... 1949 \n", "12 75 133 116 103 ... 105 \n", "13 1665 1607 1368 1309 ... 1576 \n", "14 1076 1232 1077 1041 ... 1042 \n", "15 1975 2037 1832 1532 ... 1645 \n", "16 2730 2448 2078 2240 ... 3112 \n", "17 357 411 241 406 ... 280 \n", "18 1175 1147 819 847 ... 963 \n", "\n", " age_45to49y age_50to54y age_55to59y age_60to64y age_65to69y \\\n", "0 1147 1257 1215 1093 1080 \n", "1 2749 2942 2833 2810 2181 \n", "2 3070 3336 2982 3323 3306 \n", "3 457 483 317 555 345 \n", "4 1937 1910 1748 1744 1629 \n", "5 4312 4021 3607 3903 3812 \n", "6 877 787 558 533 753 \n", "7 1584 1434 1195 1521 1390 \n", "8 477 579 604 365 519 \n", "9 3050 2580 2298 2596 2652 \n", "10 1477 1504 1602 1413 1103 \n", "11 2320 2269 2051 2102 2022 \n", "12 78 105 110 138 114 \n", "13 1461 1424 1501 1432 1329 \n", "14 993 1178 1086 1305 991 \n", "15 1766 1985 1937 2080 1717 \n", "16 2767 2477 2020 2349 2258 \n", "17 403 398 332 423 502 \n", "18 1058 915 715 876 851 \n", "\n", " age_70to74y age_75to79y age_80to84y age_above85y \n", "0 616 524 349 328 \n", "1 1753 991 789 648 \n", "2 2489 1763 1315 727 \n", "3 303 171 74 174 \n", "4 1400 1140 576 365 \n", "5 3038 2551 1433 1578 \n", "6 712 245 326 320 \n", "7 1117 814 450 503 \n", "8 588 371 334 241 \n", "9 2073 1777 1184 737 \n", "10 1079 806 363 225 \n", "11 1540 1109 410 742 \n", "12 87 68 68 57 \n", "13 1432 1056 527 501 \n", "14 772 552 497 242 \n", "15 1358 1092 494 638 \n", "16 1405 913 843 497 \n", "17 357 285 117 172 \n", "18 692 460 330 240 \n", "\n", "[19 rows x 21 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3.insert(1, \"Year\", \"2020\", True)\n", "df3" ] }, { "cell_type": "markdown", "id": "bb001b45-3588-411a-8721-83af265e6b32", "metadata": {}, "source": [ "If there are files for each year then oncatenate the data for all these years using concat function. Here I just used df3 as I showed only for one year, similarly we can create multiple dataframes for each year and concatenate using the below function" ] }, { "cell_type": "code", "execution_count": null, "id": "e254361f-15aa-440b-a597-651bd56263bc", "metadata": {}, "outputs": [], "source": [ "dataframes = [df3]\n", " \n", "result = pd.concat(dataframes)" ] }, { "cell_type": "markdown", "id": "80e41a62-5de2-450a-a821-211d4ff59de3", "metadata": {}, "source": [ "### Load the result to the sql server database\n", "\n", "Now, connect to the sql and load the dataframe as new table using SQLAlchemy. It is a sql toolkit to connect and perform all the necessary sql operations from python." ] }, { "cell_type": "markdown", "id": "05914398-d17b-4c4a-aba1-2b3a761577aa", "metadata": {}, "source": [ "svr is the sql server name
\n", "uid is the user id
\n", "pwd is the password
\n", "db is the database which is USZIPCODE
\n", "create_engine will establish python jdbc to the sql
" ] }, { "cell_type": "code", "execution_count": null, "id": "31ea1c6d-c6e7-4c38-b6e0-b6b71f8190b4", "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import create_engine\n", "import pandas as pd\n", "svr = 'essql1.walton.uark.edu'\n", "uid = 'XXXXXX'\n", "pwd = 'XXXXXX'\n", "db = 'USZIPCODE'\n", "# Create a connection using SQL Alchemy engine\n", "engine = create_engine(f\"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes\", fast_executemany=True)\n", "# Using pandas, leverage the SQL Alchemy engine that we created above\n", "result.to_sql('AGE',engine, if_exists='replace')" ] }, { "cell_type": "markdown", "id": "0f16a50a-a4aa-4148-99cd-93593a4c186a", "metadata": {}, "source": [ "The above peice of code will create a new table called 'AGE' in the 'USZIPCODE' database. The table will have all the columns which are present in the results dataframe. " ] }, { "cell_type": "markdown", "id": "cca6efd4-69f5-4da2-847a-ef1bde4b1ac7", "metadata": {}, "source": [ "We can validate the data by logging into the sql server directly or we can read the table using read_sql as shown below:" ] }, { "cell_type": "code", "execution_count": null, "id": "cd01b8a8-3304-4173-9ce9-e3b5a2b8aa36", "metadata": {}, "outputs": [], "source": [ "df = pd.read_sql('select * from AGE', engine)\n", "print('***Age data for the ZCTA5***')\n", "print(df)" ] }, { "cell_type": "markdown", "id": "986f8875-aa88-42f2-9883-e6b211ed8e44", "metadata": {}, "source": [ "## Use cases\n", "There are many use cases which can be derived from the age group dataset. Below are few use cases. \n", "For all the use cases below, please update the following fields in the below python code before executing\n", "\n", "1) uid - user id\n", "2) pwd - password\n", "3) SQL Query - the sql query which needs to be executed\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a105fc46-6386-4778-8636-751e11cc220b", "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import create_engine\n", "import pandas as pd\n", "svr = 'essql1.walton.uark.edu'\n", "uid = 'xxx'\n", "pwd = 'xxx'\n", "db = 'USZIPCODE'\n", "# Create a connection using SQL Alchemy engine\n", "engine = create_engine(f\"mssql+pyodbc://{uid}:{pwd}@{svr}/{db}?driver=SQL+Server+Native+Client+11.0&TrustServerCertificate=yes\", fast_executemany=True)\n", "# Using pandas, leverage the SQL Alchemy engine that we created above\n", "df = pd.read_sql('SQL Query', engine)\n", "print(df)" ] }, { "cell_type": "markdown", "id": "fc721edd-d998-4972-bdf6-fecc6083a159", "metadata": {}, "source": [ "### Use case 1: \n", "Find the total population of all age groups across all years and ZCTA5\n", "```sql\n", "select year, sum(total_population) as total_population \n", "from [USZIPCODE].[dbo].[AGE] \n", "group by year;\n", "```" ] }, { "cell_type": "markdown", "id": "d626f8ad-67df-4baf-8a98-b6fe3fadf1e2", "metadata": {}, "source": [ "### Use case 2:\n", "Top five ZCTA5 codes where the population is highest\n", "```sql\n", "select top(5) zcta5,total_population,year \n", "from [USZIPCODE].[dbo].[AGE] \n", "where year=2020 \n", "order by total_population desc;\n", "```" ] }, { "cell_type": "markdown", "id": "293ffbfc-7ccb-41eb-90ae-b7ad0ebf7adb", "metadata": {}, "source": [ "### Use case 3:\n", "ZCTA5 with the highest teenage population for a given year\n", "\n", "```sql\n", "select top(1) ZCTA5, total \n", "from ( select ZCTA5,sum(age_10to14y + age_15to19y) as total \n", " from [USZIPCODE].[dbo].[AGE]\n", " where year=2020 \n", " group by ZCTA5)\n", "order by total desc;\n", "```" ] }, { "cell_type": "markdown", "id": "190e030f-0185-4275-9a0b-3d5eaec05a02", "metadata": {}, "source": [ "### Use case 4:\n", "Population growth across years\n", "\n", "```sql\n", "select year,total_population, total_population-LAG(total_population) OVER (ORDER BY year) AS population_diff \n", "from ( select year, sum(total_population) as total_population \n", " from [USZIPCODE].[dbo].[AGE] \n", " group by year);\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe" } } }, "nbformat": 4, "nbformat_minor": 5 }