Exploratory Data Analysis#

Telling the future is the exciting part of data mining. Everyone wants to build the best model that can be the most accurate prediction of future results or have the best explanation for why the past occurred the way it has. But in order to have a valuable model, it is important to understand the data, determine the shape, understand the predictors and the target variables (if any). Understanding the scale of these values and their relationship to each other can save hours of testing different modelling techniques and parameter tuning.

The next cell is one that will appear in some configuration as the first in nearly every notebook. It imports the key libraries we are going to use in our analysis and model building. In the first case, we will depend on pandas and numpy for our data manipulation and we’ll leverage matplotlib as our graphical library. We’ll also use the seaborn library to show off a few plots and visuals that are not quite as readily accessible with the matplotlib library.

# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import sys
sys.path.append('..')
from src.data import load_data

# If you are using a 'light' them
# comment out or remove this line
pd.set_option('display.precision',4)
plt.style.use('dark_background')

Boston Housing Dataset#

Let’s take a look at some basic data manipulation with pandas and understand how to get some data to work with. In all of our examples, we’ll use a pretty standard text format called (CSV) or comma-separated-values files. This format is readable by nearly every statistical software package and by humans. The first row is typically the name of the columns and each line of the file is a row of data with the values separated by commas. The pandas library supports many different ways to load up a dataframe, which we will use as the primary mechanism for manipulating data in these notebooks.

Business Context#

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository):

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to ﬁve Boston employment centers
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per 10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in 1000s
CAT.MEDV: Is median value of owner-occupied home in tract above $30k (CAT.MEDV = 1) or not (CAT.MEDV = 0)

We can see that the input attributes have a mixture of units.

# Load up the housing dataset
housing_df = load_data('BostonHousing')
housing_df

/workspaces/py4analytics/book/data/BostonHousing.csv

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	LSTAT	MEDV	CAT. MEDV
0	0.0063	18.0	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0	0
1	0.0273	0.0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6	0
2	0.0273	0.0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7	1
3	0.0324	0.0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4	1
4	0.0691	0.0	2.18	0	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.0626	0.0	11.93	0	0.573	6.593	69.1	2.4786	1	273	21.0	9.67	22.4	0
502	0.0453	0.0	11.93	0	0.573	6.120	76.7	2.2875	1	273	21.0	9.08	20.6	0
503	0.0608	0.0	11.93	0	0.573	6.976	91.0	2.1675	1	273	21.0	5.64	23.9	0
504	0.1096	0.0	11.93	0	0.573	6.794	89.3	2.3889	1	273	21.0	6.48	22.0	0
505	0.0474	0.0	11.93	0	0.573	6.030	80.8	2.5050	1	273	21.0	7.88	11.9	0

506 rows × 14 columns

Exploratory Data Analysis (EDA)#

Once we understand the business content, then we want to take a look at our data and see what else we can discover about the relationships between our target variable and the other independent factors.

We’ll end up doing quite a bit of EDA, usually we’ll start with some EDA, clean up the data and munge it into an appropriate format for modeling and then we’ll want to check out the results. It pays to have a good grasp of a plotting library and some techniques to make this process go a bit faster. There are a few libraries which are pretty common place in the data science with python world, including seaborn and matplotlib. Many new libraries have been introduced recently as well that add much more interactive opportunities with less coding.

# Change the column to be more convenient (notice the space between . MEDV)
housing_df.rename(columns={'CAT. MEDV':'CAT_MEDV'},inplace=True)
# Take a look at the first few rows of data
housing_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	LSTAT	MEDV	CAT_MEDV
0	0.0063	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0	0
1	0.0273	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6	0
2	0.0273	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7	1
3	0.0324	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4	1
4	0.0691	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2	1

# check the rows and columns
housing_df.shape

(506, 14)

We see from the ouput about that we have 506 rows and 14 columns, but we can’t see all the columns - let’s check out the column names and get an idea of the some descriptive statistics for each numerical column)

housing_df.describe()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	LSTAT	MEDV	CAT_MEDV
count	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000	506.0000
mean	3.6135	11.3636	11.1368	0.0692	0.5547	6.2846	68.5749	3.7950	9.5494	408.2372	18.4555	12.6531	22.5328	0.1660
std	8.6015	23.3225	6.8604	0.2540	0.1159	0.7026	28.1489	2.1057	8.7073	168.5371	2.1649	7.1411	9.1971	0.3725
min	0.0063	0.0000	0.4600	0.0000	0.3850	3.5610	2.9000	1.1296	1.0000	187.0000	12.6000	1.7300	5.0000	0.0000
25%	0.0820	0.0000	5.1900	0.0000	0.4490	5.8855	45.0250	2.1002	4.0000	279.0000	17.4000	6.9500	17.0250	0.0000
50%	0.2565	0.0000	9.6900	0.0000	0.5380	6.2085	77.5000	3.2074	5.0000	330.0000	19.0500	11.3600	21.2000	0.0000
75%	3.6771	12.5000	18.1000	0.0000	0.6240	6.6235	94.0750	5.1884	24.0000	666.0000	20.2000	16.9550	25.0000	0.0000
max	88.9762	100.0000	27.7400	1.0000	0.8710	8.7800	100.0000	12.1265	24.0000	711.0000	22.0000	37.9700	50.0000	1.0000

housing_df.dtypes

CRIM        float64
ZN          float64
INDUS       float64
CHAS          int64
NOX         float64
RM          float64
AGE         float64
DIS         float64
RAD           int64
TAX           int64
PTRATIO     float64
LSTAT       float64
MEDV        float64
CAT_MEDV      int64
dtype: object

Now that we have an idea of the numerical fields. We should check out the distribution of the CAT_MEDV field to see how these are laid out.

print (housing_df.value_counts(['CAT_MEDV']))
# and also the percentages
housing_df.value_counts(['CAT_MEDV'])/len(housing_df)

CAT_MEDV
0           422
1            84
dtype: int64

CAT_MEDV
0           0.834
1           0.166
dtype: float64

We can now take a look at a couple of values as they relate to our target variable (CAT_MEDV).

housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False)

<AxesSubplot:xlabel='LSTAT', ylabel='MEDV'>

../../_images/310-ExploratoryDataAnalysis_13_1.png

ax=housing_df.groupby('CHAS').mean().MEDV.plot(kind='bar')
ax.set_ylabel('Avg. MEDV')

Text(0, 0.5, 'Avg. MEDV')

../../_images/310-ExploratoryDataAnalysis_14_1.png

dataForPlot= housing_df.groupby('CHAS').mean()['CAT_MEDV']*100
ax=dataForPlot.plot(kind='bar', figsize=[5,3])
ax.set_ylabel('% of CAT.MEDV')

Text(0, 0.5, '% of CAT.MEDV')

../../_images/310-ExploratoryDataAnalysis_15_1.png

Determining the interaction between a set of variables#

Sometimes it is helpful to see many more dimensions of the data at once. We can use color, size, shape and axises to show several dimensions, and one more commonly overlooked approach is to use faceting as yet another dimension. Let’s take a look at how the age of the houses vary by the relativeness to the Charles river.

From here we can see that, as previously, there are many more homes not on the Charles River and also that the age of the homes is skewed heavily toward the older home ages.

(Check out the extra notebook Visualization Samples to see another library and possible visualizations.)

g = sns.FacetGrid(housing_df, col='CHAS')
g.map_dataframe(sns.histplot,x='AGE')

<seaborn.axisgrid.FacetGrid at 0x7f8980423ca0>

../../_images/310-ExploratoryDataAnalysis_17_1.png

We may also be interested in the relationship between a set of the variables so that we can identify which ones may prove to be over-influencing a regression model. For this we can use two approaches, first we’ll look at a set of charts that are related in a pair-wise chart or a correlation map. First a pairwise graph. A pairwise graph shows the relationship between these 4 different variables to each other in one simple clean chart.

# setup a small subset of data
df = housing_df[['CRIM','INDUS','LSTAT',"RM","MEDV"]]
sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x7f8980423d00>

../../_images/310-ExploratoryDataAnalysis_19_1.png

Looking at the pair plots, it seems that we have a very clear relationship between RM and MEDV, an inverse linear relationship between LSAT and RM and also a slightly or maybe non-linear relationship between LSTAT and MEDV.

Let’s take a look and see if we can find any correlations

print(housing_df.corr())
sns.heatmap(df.corr(),  fmt='.2f',
    annot=True,
    annot_kws={'size': 15},
    cmap='YlGnBu')

            CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS  \
CRIM      1.0000 -0.2005  0.4066 -0.0559  0.4210 -0.2192  0.3527 -0.3797   
ZN       -0.2005  1.0000 -0.5338 -0.0427 -0.5166  0.3120 -0.5695  0.6644   
INDUS     0.4066 -0.5338  1.0000  0.0629  0.7637 -0.3917  0.6448 -0.7080   
CHAS     -0.0559 -0.0427  0.0629  1.0000  0.0912  0.0913  0.0865 -0.0992   
NOX       0.4210 -0.5166  0.7637  0.0912  1.0000 -0.3022  0.7315 -0.7692   
RM       -0.2192  0.3120 -0.3917  0.0913 -0.3022  1.0000 -0.2403  0.2052   
AGE       0.3527 -0.5695  0.6448  0.0865  0.7315 -0.2403  1.0000 -0.7479   
DIS      -0.3797  0.6644 -0.7080 -0.0992 -0.7692  0.2052 -0.7479  1.0000   
RAD       0.6255 -0.3119  0.5951 -0.0074  0.6114 -0.2098  0.4560 -0.4946   
TAX       0.5828 -0.3146  0.7208 -0.0356  0.6680 -0.2920  0.5065 -0.5344   
PTRATIO   0.2899 -0.3917  0.3832 -0.1215  0.1889 -0.3555  0.2615 -0.2325   
LSTAT     0.4556 -0.4130  0.6038 -0.0539  0.5909 -0.6138  0.6023 -0.4970   
MEDV     -0.3883  0.3604 -0.4837  0.1753 -0.4273  0.6954 -0.3770  0.2499   
CAT_MEDV -0.1520  0.3653 -0.3663  0.1086 -0.2325  0.6413 -0.1912  0.1189   

             RAD     TAX  PTRATIO   LSTAT    MEDV  CAT_MEDV  
CRIM      0.6255  0.5828   0.2899  0.4556 -0.3883   -0.1520  
ZN       -0.3119 -0.3146  -0.3917 -0.4130  0.3604    0.3653  
INDUS     0.5951  0.7208   0.3832  0.6038 -0.4837   -0.3663  
CHAS     -0.0074 -0.0356  -0.1215 -0.0539  0.1753    0.1086  
NOX       0.6114  0.6680   0.1889  0.5909 -0.4273   -0.2325  
RM       -0.2098 -0.2920  -0.3555 -0.6138  0.6954    0.6413  
AGE       0.4560  0.5065   0.2615  0.6023 -0.3770   -0.1912  
DIS      -0.4946 -0.5344  -0.2325 -0.4970  0.2499    0.1189  
RAD       1.0000  0.9102   0.4647  0.4887 -0.3816   -0.1979  
TAX       0.9102  1.0000   0.4609  0.5440 -0.4685   -0.2737  
PTRATIO   0.4647  0.4609   1.0000  0.3740 -0.5078   -0.4434  
LSTAT     0.4887  0.5440   0.3740  1.0000 -0.7377   -0.4699  
MEDV     -0.3816 -0.4685  -0.5078 -0.7377  1.0000    0.7898  
CAT_MEDV -0.1979 -0.2737  -0.4434 -0.4699  0.7898    1.0000  

<AxesSubplot:>

../../_images/310-ExploratoryDataAnalysis_21_2.png

correlationMatrix = df.corr()
# Convert our correlationMatrix to a one-dimensional array
correlationMatrix = correlationMatrix.unstack()
correlationMatrix[abs(correlationMatrix) > 0.7]

CRIM   CRIM     1.0000
INDUS  INDUS    1.0000
LSTAT  LSTAT    1.0000
       MEDV    -0.7377
RM     RM       1.0000
MEDV   LSTAT   -0.7377
       MEDV     1.0000
dtype: float64

Other datasets that are of interest for EDA#

Amtrack - includes data for timeseries analysis
Bankruptcy] - lots of factors to consider
Motor Vehicle Collisions] - includes lat/long, useful for mapping visuals

Other tools for EDA#

You may find the notebook Other Tools for EDA an interesting tour as well.

Python 4 Analytics

Exploratory Data Analysis

Contents