Working with data#

The purpose of this lesson is to expose students to libraries for preparing and manipulating “rectangular” data files (that is data which has both rows and columns, where each row has the same number of columns).

Tip

A python library is a collection of related modules containing bundles of codes that can be used in different programs. Some smart folks wrote useful blocks of codes that solves specific problems, and saved it in a python file (module). You will be introduced to some relevant libraries used in data analysis.

This notebook covers a few functions that are available with the pandas library. There is an additional notebook which goes into several of the handiest pandas functions here

The next cell is one that will appear in some configuration as the first in nearly every notebook. It imports the key libraries we are going to use in our analysis and model building. In the first case, we will depend on pandas and numpy for our data manipulation and we’ll leverage matplotlib as our graphical library. We’ll also use the seaborn library to show off a few plots and visuals that are not quite as readily accessible with the matplotlib library.

# Setup code
# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import sys
sys.path.append('..')
from src.data import load_data, convert_to_bool

# Limit the precision to 4 significant digits
pd.set_option('display.precision',4)

Importing the dataset#

Boston Housing Dataset Let’s take a look at some basic data manipulation with pandas and understand how to get some data to work with. In all of our examples, we’ll use a pretty standard text format called (CSV) or comma-separated-values files. This format is readable by nearly every statistical software package and by humans. The first row is typically the name of the columns and each line of the file is a row of data with the values separated by commas. The pandas library supports many different ways to load up a dataframe, which we will use as the primary mechanism for manipulating data in these notebooks.

This particular file is available in the data directory BostonHousing.

Business Context Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository):

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to ﬁve Boston employment centers
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per 10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in 1000s
CAT. MEDV: Is median value of owner-occupied home in tract above $30k (CAT.MEDV = 1) or not (CAT.MEDV = 0)

We can see that the input attributes have a mixture of units.

# Load up the housing dataset
housing_df = load_data('BostonHousing')
# Change the column to be more convenient (notice the space between . MEDV)
housing_df.rename(columns={'CAT. MEDV':'CAT_MEDV'},inplace=True)
# convert the categorical variable to boolean
convert_to_bool(housing_df, 'CAT_MEDV',inplace=True)
# Print the first 5 rows and the last 5 rows
housing_df

/workspaces/py4a/book/data/BostonHousing.csv

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	LSTAT	MEDV	CAT_MEDV
0	0.0063	18.0	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0	False
1	0.0273	0.0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6	False
2	0.0273	0.0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7	True
3	0.0324	0.0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4	True
4	0.0691	0.0	2.18	0	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.0626	0.0	11.93	0	0.573	6.593	69.1	2.4786	1	273	21.0	9.67	22.4	False
502	0.0453	0.0	11.93	0	0.573	6.120	76.7	2.2875	1	273	21.0	9.08	20.6	False
503	0.0608	0.0	11.93	0	0.573	6.976	91.0	2.1675	1	273	21.0	5.64	23.9	False
504	0.1096	0.0	11.93	0	0.573	6.794	89.3	2.3889	1	273	21.0	6.48	22.0	False
505	0.0474	0.0	11.93	0	0.573	6.030	80.8	2.5050	1	273	21.0	7.88	11.9	False

506 rows × 14 columns

Data Inspection#

One of the first things we want to do in our process is to take a look at the data we have and see what kinds of issues we might be dealing with. For simple datasets, this can be a quick glance at a table of data, for move complex datasets or issues it will be helpful to use some kind of graphical analysis.

In the example above we see just 10 rows of the data (the first 5 and the last 5). We can also inspect a few more from the front and a few more in the back using head() and tail().

housing_df.head(10)

housing_df.tail(10)

We can clearly see that the dataset has 505 rows ( the number on the side is the index and shows us such) and we have already been told how many columns we have. But we can also use a few handy features to get this information from our dataset.

# Shape tells us the number of rows and columns
housing_df.shape

We see from the ouput that we have 506 rows and 14 columns, but we can’t see all the columns - let’s check out the column names and get an idea of some of the descriptive statistics for each numerical column)

housing_df.describe()

We can see from our data dictionary, provided above, that CAT_MEDV is meant to be a categorical value (boolean)- not a numeric value. So the descriptive statistics for it don’t make much sense. We can see however that there are 2 unique values (good, True/False) that False is the most common and it occurs 422 out of of 506 times.

Quick plots and charts#

We’ve got some interesting data here and we can get some quick plots to see how the data is distributed. For instance, we might be interested in how old the houses are or what the crime stats are like. With this we can use the built-in dataframe functions for plotting.

# It's always safe to use the `index` method to get a particular column
housing_df['AGE'].hist()

housing_df['CRIM'].hist()

We can also use the handy seaborn library which gives us a bit more control over the output. There is lots and lots of examples of using the seaborn library on the website with a great tutorial.

sns.histplot(data=housing_df,x='INDUS',color='yellow').set_title('proportion of non-retail business acres per town')

Determining the interaction between a set of variables#

Sometimes it is helpful to see many more dimensions of the data at once. We can use color, size, shape and axises to show several dimensions, and one more commonly overlooked approach is to use faceting as yet another dimension. Let’s take a look at how the age of the houses vary by the relativeness to the Charles river.

From here we can see that, as previously, there are many more homes not on the Charles River and also that the age of the homes is skewed heavily toward the older home ages.

# Show a scatterplot relating LSTAT to MEDV
housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False)

sns.boxplot(data=housing_df,x='CHAS',y='LSTAT')

We may also be interested in the relationship between a set of the variables so that we can identify which ones may prove to be over-influencing a regression model. For this we can use two approaches, first we’ll look at a set of charts that are related in a pair-wise chart or a correlation map. First a pairwise graph. A pairwise graph shows the relationship between these 5 different variables to each other in one simple clean chart.

df = housing_df[['CRIM','INDUS','LSTAT',"RM","MEDV"]]
sns.pairplot(df)

Data manipulation#

Visualization is helpful and can be really useful in Python in lots of situations, especially if you need something super specific that can’t be done using other tools - or if you just need a quick visual of what’s happening. The real power unleashed by coding is to manipulate the data in a series of steps that can be repeated over and over again. This is where pandas and Python shine.

Where to get more help#

The pandas documentation is very good
The 10 minute guide to pandas is a great place to start
Pandas comparison to spreadsheets
There are alot of examples in the pandas cookbook, but if you can’t find it there - try stackoverflow

First thing we need to know about dataframes is how they are accessed. In other words, how do we get at specific rows/columns in the data. Below are a few indexing techniques.

Insurance Dataset The following dataset represents individuals and their health insurances charges from a US company. Some of the key indicators that influence the cost of the insurance are in this dataset.

age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance

# First get the data and take a quick look at the set
insurance_df = load_data('insurance')
insurance_df

# Gather the first row
insurance_df.iloc[0]

# The rows between 10-20
insurance_df.iloc[10:20]

Pandas Series#

With dataframes, the columns of data are represented as a collection of series.

insurance_df.columns

We can look at the data in single series

# All the ages
insurance_df['age']

# but if we want more than one column (series) we need to specify a list
list_of_columns = ['age','sex']
insurance_df[list_of_columns]

# More commonly we just use the list directly
insurance_df[['age','sex']]

We can use this for lots of great stuff like aggregating values in a column.

# What is the maximum charges
insurance_df['charges'].max()

Filtering#

Filtering is a bit more complex. What we need is a boolean array of values to index our dataframe by. I’ll leave the explanation to the experts, but show a few examples.

# All the smokers
insurance_df[insurance_df['smoker']=='yes']

# Or using the query syntax
insurance_df.query('smoker=="yes"')

# and now, just the male smokers
insurance_df.query('sex=="male" & smoker=="yes"')

# We can also group our queries and then apply aggregates
# In this example, we are grouping by 'sex' and then using only the 'bmi' series, getting the mean()
insurance_df.groupby('sex')['bmi'].mean()

# Or by two categories
# Here we group by two columns and still just looking at bmi
insurance_df.groupby(['sex','smoker'])['bmi'].mean()

Working on data in a series#

The library is incredibly powerful, but just a few things we want to do. Let’s take a few examples.

A few things to note

all operations apply to all the items in the series by default, so no need to loop
while we can change values in a series, more commonly we’ll just create a new series and replace the old one (in one fellswoop)
notice that what we are really doing is creating a new series anyway (below, we are rounding the numbers, but this creates a new series - unconnected to our dataframe)

# Create a new column that just says whether someone has children
# If the number of children is > 0 then True else False
insurance_df['has_children'] = insurance_df['children'] > 0
insurance_df

# Now, let's round the amount of the charges to 2 decimals
insurance_df['charges'] = round(insurance_df['charges'],2)
insurance_df['charges']

# assuming these charges are an annual rate, let's create a column of the monthly rate
insurance_df['monthly_charges']=round(insurance_df['charges']/12,2)
insurance_df

Working with strings in the series#

Working with strings in a series are a little different, but they look alot like the python string functions. We need to tell pandas what the type of data we want to work with is

# Working with a string, ensure all the male and female strings are UPPERCASED
insurance_df['gender']=insurance_df['sex'].str.upper()
insurance_df

# or we could uppercase just the first letter of the region for instance
insurance_df['region'] = insurance_df['region'].str.title()
insurance_df

Python 4 Analytics

Working with data

Contents