Working with data#

The purpose of this lesson is to expose students to libraries for preparing and manipulating “rectangular” data files (that is data which has both rows and columns, where each row has the same number of columns).

Tip

A python library is a collection of related modules containing bundles of codes that can be used in different programs. Some smart folks wrote useful blocks of codes that solves specific problems, and saved it in a python file (module). You will be introduced to some relevant libraries used in data analysis.

This notebook covers a few functions that are available with the pandas library. There is an additional notebook which goes into several of the handiest pandas functions here

The next cell is one that will appear in some configuration as the first in nearly every notebook. It imports the key libraries we are going to use in our analysis and model building. In the first case, we will depend on pandas and numpy for our data manipulation and we’ll leverage matplotlib as our graphical library. We’ll also use the seaborn library to show off a few plots and visuals that are not quite as readily accessible with the matplotlib library.

# Setup code
# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import sys
sys.path.append('..')
from src.data import load_data, convert_to_bool

# Limit the precision to 4 significant digits
pd.set_option('display.precision',4)

Importing the dataset#

Boston Housing Dataset Let’s take a look at some basic data manipulation with pandas and understand how to get some data to work with. In all of our examples, we’ll use a pretty standard text format called (CSV) or comma-separated-values files. This format is readable by nearly every statistical software package and by humans. The first row is typically the name of the columns and each line of the file is a row of data with the values separated by commas. The pandas library supports many different ways to load up a dataframe, which we will use as the primary mechanism for manipulating data in these notebooks.

This particular file is available in the data directory BostonHousing.

Business Context Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository):

  • CRIM: per capita crime rate by town

  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS: proportion of non-retail business acres per town

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  • NOX: nitric oxides concentration (parts per 10 million)

  • RM: average number of rooms per dwelling

  • AGE: proportion of owner-occupied units built prior to 1940

  • DIS: weighted distances to five Boston employment centers

  • RAD: index of accessibility to radial highways

  • TAX: full-value property-tax rate per 10,000

  • PTRATIO: pupil-teacher ratio by town

  • B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town

  • LSTAT: % lower status of the population

  • MEDV: Median value of owner-occupied homes in 1000s

  • CAT. MEDV: Is median value of owner-occupied home in tract above $30k (CAT.MEDV = 1) or not (CAT.MEDV = 0)

We can see that the input attributes have a mixture of units.

# Load up the housing dataset
housing_df = load_data('BostonHousing')
# Change the column to be more convenient (notice the space between . MEDV)
housing_df.rename(columns={'CAT. MEDV':'CAT_MEDV'},inplace=True)
# convert the categorical variable to boolean
convert_to_bool(housing_df, 'CAT_MEDV',inplace=True)
# Print the first 5 rows and the last 5 rows
housing_df
/workspaces/py4a/book/data/BostonHousing.csv
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT_MEDV
0 0.0063 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 False
1 0.0273 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 False
2 0.0273 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 True
3 0.0324 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 True
4 0.0691 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.0626 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 9.67 22.4 False
502 0.0453 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 9.08 20.6 False
503 0.0608 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 False
504 0.1096 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 6.48 22.0 False
505 0.0474 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 7.88 11.9 False

506 rows × 14 columns

Data Inspection#

One of the first things we want to do in our process is to take a look at the data we have and see what kinds of issues we might be dealing with. For simple datasets, this can be a quick glance at a table of data, for move complex datasets or issues it will be helpful to use some kind of graphical analysis.

In the example above we see just 10 rows of the data (the first 5 and the last 5). We can also inspect a few more from the front and a few more in the back using head() and tail().

housing_df.head(10)
housing_df.tail(10)

We can clearly see that the dataset has 505 rows ( the number on the side is the index and shows us such) and we have already been told how many columns we have. But we can also use a few handy features to get this information from our dataset.

# Shape tells us the number of rows and columns
housing_df.shape

We see from the ouput that we have 506 rows and 14 columns, but we can’t see all the columns - let’s check out the column names and get an idea of some of the descriptive statistics for each numerical column)

housing_df.describe()

We can see from our data dictionary, provided above, that CAT_MEDV is meant to be a categorical value (boolean)- not a numeric value. So the descriptive statistics for it don’t make much sense. We can see however that there are 2 unique values (good, True/False) that False is the most common and it occurs 422 out of of 506 times.

Quick plots and charts#

We’ve got some interesting data here and we can get some quick plots to see how the data is distributed. For instance, we might be interested in how old the houses are or what the crime stats are like. With this we can use the built-in dataframe functions for plotting.

# It's always safe to use the `index` method to get a particular column
housing_df['AGE'].hist()
housing_df['CRIM'].hist()

We can also use the handy seaborn library which gives us a bit more control over the output. There is lots and lots of examples of using the seaborn library on the website with a great tutorial.

sns.histplot(data=housing_df,x='INDUS',color='yellow').set_title('proportion of non-retail business acres per town')

Determining the interaction between a set of variables#

Sometimes it is helpful to see many more dimensions of the data at once. We can use color, size, shape and axises to show several dimensions, and one more commonly overlooked approach is to use faceting as yet another dimension. Let’s take a look at how the age of the houses vary by the relativeness to the Charles river.

From here we can see that, as previously, there are many more homes not on the Charles River and also that the age of the homes is skewed heavily toward the older home ages.

# Show a scatterplot relating LSTAT to MEDV
housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False)
sns.boxplot(data=housing_df,x='CHAS',y='LSTAT')

We may also be interested in the relationship between a set of the variables so that we can identify which ones may prove to be over-influencing a regression model. For this we can use two approaches, first we’ll look at a set of charts that are related in a pair-wise chart or a correlation map. First a pairwise graph. A pairwise graph shows the relationship between these 5 different variables to each other in one simple clean chart.

df = housing_df[['CRIM','INDUS','LSTAT',"RM","MEDV"]]
sns.pairplot(df)

Data manipulation#

Visualization is helpful and can be really useful in Python in lots of situations, especially if you need something super specific that can’t be done using other tools - or if you just need a quick visual of what’s happening. The real power unleashed by coding is to manipulate the data in a series of steps that can be repeated over and over again. This is where pandas and Python shine.

Where to get more help#

First thing we need to know about dataframes is how they are accessed. In other words, how do we get at specific rows/columns in the data. Below are a few indexing techniques.

Insurance Dataset The following dataset represents individuals and their health insurances charges from a US company. Some of the key indicators that influence the cost of the insurance are in this dataset.

  • age: age of primary beneficiary

  • sex: insurance contractor gender, female, male

  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

  • children: Number of children covered by health insurance / Number of dependents

  • smoker: Smoking

  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

  • charges: Individual medical costs billed by health insurance

# First get the data and take a quick look at the set
insurance_df = load_data('insurance')
insurance_df
# Gather the first row
insurance_df.iloc[0]
# The rows between 10-20
insurance_df.iloc[10:20]

Pandas Series#

With dataframes, the columns of data are represented as a collection of series.

insurance_df.columns

We can look at the data in single series

# All the ages
insurance_df['age']
# but if we want more than one column (series) we need to specify a list
list_of_columns = ['age','sex']
insurance_df[list_of_columns]
# More commonly we just use the list directly
insurance_df[['age','sex']]

We can use this for lots of great stuff like aggregating values in a column.

# What is the maximum charges
insurance_df['charges'].max()

Filtering#

Filtering is a bit more complex. What we need is a boolean array of values to index our dataframe by. I’ll leave the explanation to the experts, but show a few examples.

# All the smokers
insurance_df[insurance_df['smoker']=='yes']
# Or using the query syntax
insurance_df.query('smoker=="yes"')
# and now, just the male smokers
insurance_df.query('sex=="male" & smoker=="yes"')
# We can also group our queries and then apply aggregates
# In this example, we are grouping by 'sex' and then using only the 'bmi' series, getting the mean()
insurance_df.groupby('sex')['bmi'].mean()
# Or by two categories
# Here we group by two columns and still just looking at bmi
insurance_df.groupby(['sex','smoker'])['bmi'].mean()

Working on data in a series#

The library is incredibly powerful, but just a few things we want to do. Let’s take a few examples.

A few things to note

  • all operations apply to all the items in the series by default, so no need to loop

  • while we can change values in a series, more commonly we’ll just create a new series and replace the old one (in one fellswoop)

  • notice that what we are really doing is creating a new series anyway (below, we are rounding the numbers, but this creates a new series - unconnected to our dataframe)

# Create a new column that just says whether someone has children
# If the number of children is > 0 then True else False
insurance_df['has_children'] = insurance_df['children'] > 0
insurance_df
# Now, let's round the amount of the charges to 2 decimals
insurance_df['charges'] = round(insurance_df['charges'],2)
insurance_df['charges']
# assuming these charges are an annual rate, let's create a column of the monthly rate
insurance_df['monthly_charges']=round(insurance_df['charges']/12,2)
insurance_df

Working with strings in the series#

Working with strings in a series are a little different, but they look alot like the python string functions. We need to tell pandas what the type of data we want to work with is

# Working with a string, ensure all the male and female strings are UPPERCASED
insurance_df['gender']=insurance_df['sex'].str.upper()
insurance_df
# or we could uppercase just the first letter of the region for instance
insurance_df['region'] = insurance_df['region'].str.title()
insurance_df