Exploratory Data Analysis#

Telling the future is the exciting part of data mining. Everyone wants to build the best model that can be the most accurate prediction of future results or have the best explanation for why the past occurred the way it has. But in order to have a valuable model, it is important to understand the data, determine the shape, understand the predictors and the target variables (if any). Understanding the scale of these values and their relationship to each other can save hours of testing different modelling techniques and parameter tuning.

The next cell is one that will appear in some configuration as the first in nearly every notebook. It imports the key libraries we are going to use in our analysis and model building. In the first case, we will depend on pandas and numpy for our data manipulation and we’ll leverage matplotlib as our graphical library. We’ll also use the seaborn library to show off a few plots and visuals that are not quite as readily accessible with the matplotlib library.

# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import sys
sys.path.append('..')
from src.data import load_data

# If you are using a 'light' them
# comment out or remove this line
pd.set_option('display.precision',4)
plt.style.use('dark_background')

Boston Housing Dataset#

Let’s take a look at some basic data manipulation with pandas and understand how to get some data to work with. In all of our examples, we’ll use a pretty standard text format called (CSV) or comma-separated-values files. This format is readable by nearly every statistical software package and by humans. The first row is typically the name of the columns and each line of the file is a row of data with the values separated by commas. The pandas library supports many different ways to load up a dataframe, which we will use as the primary mechanism for manipulating data in these notebooks.

Business Context#

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository):

  • CRIM: per capita crime rate by town

  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS: proportion of non-retail business acres per town

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  • NOX: nitric oxides concentration (parts per 10 million)

  • RM: average number of rooms per dwelling

  • AGE: proportion of owner-occupied units built prior to 1940

  • DIS: weighted distances to five Boston employment centers

  • RAD: index of accessibility to radial highways

  • TAX: full-value property-tax rate per 10,000

  • PTRATIO: pupil-teacher ratio by town

  • B: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town

  • LSTAT: % lower status of the population

  • MEDV: Median value of owner-occupied homes in 1000s

  • CAT.MEDV: Is median value of owner-occupied home in tract above $30k (CAT.MEDV = 1) or not (CAT.MEDV = 0)

We can see that the input attributes have a mixture of units.

# Load up the housing dataset
housing_df = load_data('BostonHousing')
housing_df
/workspaces/py4analytics/book/data/BostonHousing.csv
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT. MEDV
0 0.0063 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 0
1 0.0273 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 0
2 0.0273 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 1
3 0.0324 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 1
4 0.0691 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.0626 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 9.67 22.4 0
502 0.0453 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 9.08 20.6 0
503 0.0608 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 0
504 0.1096 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 6.48 22.0 0
505 0.0474 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 7.88 11.9 0

506 rows × 14 columns

Exploratory Data Analysis (EDA)#

Once we understand the business content, then we want to take a look at our data and see what else we can discover about the relationships between our target variable and the other independent factors.

We’ll end up doing quite a bit of EDA, usually we’ll start with some EDA, clean up the data and munge it into an appropriate format for modeling and then we’ll want to check out the results. It pays to have a good grasp of a plotting library and some techniques to make this process go a bit faster. There are a few libraries which are pretty common place in the data science with python world, including seaborn and matplotlib. Many new libraries have been introduced recently as well that add much more interactive opportunities with less coding.

# Change the column to be more convenient (notice the space between . MEDV)
housing_df.rename(columns={'CAT. MEDV':'CAT_MEDV'},inplace=True)
# Take a look at the first few rows of data
housing_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT_MEDV
0 0.0063 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 0
1 0.0273 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 0
2 0.0273 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 1
3 0.0324 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 1
4 0.0691 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 1
# check the rows and columns
housing_df.shape
(506, 14)

We see from the ouput about that we have 506 rows and 14 columns, but we can’t see all the columns - let’s check out the column names and get an idea of the some descriptive statistics for each numerical column)

housing_df.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT_MEDV
count 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000 506.0000
mean 3.6135 11.3636 11.1368 0.0692 0.5547 6.2846 68.5749 3.7950 9.5494 408.2372 18.4555 12.6531 22.5328 0.1660
std 8.6015 23.3225 6.8604 0.2540 0.1159 0.7026 28.1489 2.1057 8.7073 168.5371 2.1649 7.1411 9.1971 0.3725
min 0.0063 0.0000 0.4600 0.0000 0.3850 3.5610 2.9000 1.1296 1.0000 187.0000 12.6000 1.7300 5.0000 0.0000
25% 0.0820 0.0000 5.1900 0.0000 0.4490 5.8855 45.0250 2.1002 4.0000 279.0000 17.4000 6.9500 17.0250 0.0000
50% 0.2565 0.0000 9.6900 0.0000 0.5380 6.2085 77.5000 3.2074 5.0000 330.0000 19.0500 11.3600 21.2000 0.0000
75% 3.6771 12.5000 18.1000 0.0000 0.6240 6.6235 94.0750 5.1884 24.0000 666.0000 20.2000 16.9550 25.0000 0.0000
max 88.9762 100.0000 27.7400 1.0000 0.8710 8.7800 100.0000 12.1265 24.0000 711.0000 22.0000 37.9700 50.0000 1.0000
housing_df.dtypes
CRIM        float64
ZN          float64
INDUS       float64
CHAS          int64
NOX         float64
RM          float64
AGE         float64
DIS         float64
RAD           int64
TAX           int64
PTRATIO     float64
LSTAT       float64
MEDV        float64
CAT_MEDV      int64
dtype: object

Now that we have an idea of the numerical fields. We should check out the distribution of the CAT_MEDV field to see how these are laid out.

print (housing_df.value_counts(['CAT_MEDV']))
# and also the percentages
housing_df.value_counts(['CAT_MEDV'])/len(housing_df)
CAT_MEDV
0           422
1            84
dtype: int64
CAT_MEDV
0           0.834
1           0.166
dtype: float64

We can now take a look at a couple of values as they relate to our target variable (CAT_MEDV).

housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False)
<AxesSubplot:xlabel='LSTAT', ylabel='MEDV'>
../../_images/310-ExploratoryDataAnalysis_13_1.png
ax=housing_df.groupby('CHAS').mean().MEDV.plot(kind='bar')
ax.set_ylabel('Avg. MEDV')
Text(0, 0.5, 'Avg. MEDV')
../../_images/310-ExploratoryDataAnalysis_14_1.png
dataForPlot= housing_df.groupby('CHAS').mean()['CAT_MEDV']*100
ax=dataForPlot.plot(kind='bar', figsize=[5,3])
ax.set_ylabel('% of CAT.MEDV')
Text(0, 0.5, '% of CAT.MEDV')
../../_images/310-ExploratoryDataAnalysis_15_1.png

Determining the interaction between a set of variables#

Sometimes it is helpful to see many more dimensions of the data at once. We can use color, size, shape and axises to show several dimensions, and one more commonly overlooked approach is to use faceting as yet another dimension. Let’s take a look at how the age of the houses vary by the relativeness to the Charles river.

From here we can see that, as previously, there are many more homes not on the Charles River and also that the age of the homes is skewed heavily toward the older home ages.

(Check out the extra notebook Visualization Samples to see another library and possible visualizations.)

g = sns.FacetGrid(housing_df, col='CHAS')
g.map_dataframe(sns.histplot,x='AGE')
<seaborn.axisgrid.FacetGrid at 0x7f8980423ca0>
../../_images/310-ExploratoryDataAnalysis_17_1.png

We may also be interested in the relationship between a set of the variables so that we can identify which ones may prove to be over-influencing a regression model. For this we can use two approaches, first we’ll look at a set of charts that are related in a pair-wise chart or a correlation map. First a pairwise graph. A pairwise graph shows the relationship between these 4 different variables to each other in one simple clean chart.

# setup a small subset of data
df = housing_df[['CRIM','INDUS','LSTAT',"RM","MEDV"]]
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7f8980423d00>
../../_images/310-ExploratoryDataAnalysis_19_1.png

Looking at the pair plots, it seems that we have a very clear relationship between RM and MEDV, an inverse linear relationship between LSAT and RM and also a slightly or maybe non-linear relationship between LSTAT and MEDV.

Let’s take a look and see if we can find any correlations

print(housing_df.corr())
sns.heatmap(df.corr(),  fmt='.2f',
    annot=True,
    annot_kws={'size': 15},
    cmap='YlGnBu')
            CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS  \
CRIM      1.0000 -0.2005  0.4066 -0.0559  0.4210 -0.2192  0.3527 -0.3797   
ZN       -0.2005  1.0000 -0.5338 -0.0427 -0.5166  0.3120 -0.5695  0.6644   
INDUS     0.4066 -0.5338  1.0000  0.0629  0.7637 -0.3917  0.6448 -0.7080   
CHAS     -0.0559 -0.0427  0.0629  1.0000  0.0912  0.0913  0.0865 -0.0992   
NOX       0.4210 -0.5166  0.7637  0.0912  1.0000 -0.3022  0.7315 -0.7692   
RM       -0.2192  0.3120 -0.3917  0.0913 -0.3022  1.0000 -0.2403  0.2052   
AGE       0.3527 -0.5695  0.6448  0.0865  0.7315 -0.2403  1.0000 -0.7479   
DIS      -0.3797  0.6644 -0.7080 -0.0992 -0.7692  0.2052 -0.7479  1.0000   
RAD       0.6255 -0.3119  0.5951 -0.0074  0.6114 -0.2098  0.4560 -0.4946   
TAX       0.5828 -0.3146  0.7208 -0.0356  0.6680 -0.2920  0.5065 -0.5344   
PTRATIO   0.2899 -0.3917  0.3832 -0.1215  0.1889 -0.3555  0.2615 -0.2325   
LSTAT     0.4556 -0.4130  0.6038 -0.0539  0.5909 -0.6138  0.6023 -0.4970   
MEDV     -0.3883  0.3604 -0.4837  0.1753 -0.4273  0.6954 -0.3770  0.2499   
CAT_MEDV -0.1520  0.3653 -0.3663  0.1086 -0.2325  0.6413 -0.1912  0.1189   

             RAD     TAX  PTRATIO   LSTAT    MEDV  CAT_MEDV  
CRIM      0.6255  0.5828   0.2899  0.4556 -0.3883   -0.1520  
ZN       -0.3119 -0.3146  -0.3917 -0.4130  0.3604    0.3653  
INDUS     0.5951  0.7208   0.3832  0.6038 -0.4837   -0.3663  
CHAS     -0.0074 -0.0356  -0.1215 -0.0539  0.1753    0.1086  
NOX       0.6114  0.6680   0.1889  0.5909 -0.4273   -0.2325  
RM       -0.2098 -0.2920  -0.3555 -0.6138  0.6954    0.6413  
AGE       0.4560  0.5065   0.2615  0.6023 -0.3770   -0.1912  
DIS      -0.4946 -0.5344  -0.2325 -0.4970  0.2499    0.1189  
RAD       1.0000  0.9102   0.4647  0.4887 -0.3816   -0.1979  
TAX       0.9102  1.0000   0.4609  0.5440 -0.4685   -0.2737  
PTRATIO   0.4647  0.4609   1.0000  0.3740 -0.5078   -0.4434  
LSTAT     0.4887  0.5440   0.3740  1.0000 -0.7377   -0.4699  
MEDV     -0.3816 -0.4685  -0.5078 -0.7377  1.0000    0.7898  
CAT_MEDV -0.1979 -0.2737  -0.4434 -0.4699  0.7898    1.0000  
<AxesSubplot:>
../../_images/310-ExploratoryDataAnalysis_21_2.png
correlationMatrix = df.corr()
# Convert our correlationMatrix to a one-dimensional array
correlationMatrix = correlationMatrix.unstack()
correlationMatrix[abs(correlationMatrix) > 0.7]
CRIM   CRIM     1.0000
INDUS  INDUS    1.0000
LSTAT  LSTAT    1.0000
       MEDV    -0.7377
RM     RM       1.0000
MEDV   LSTAT   -0.7377
       MEDV     1.0000
dtype: float64

Other datasets that are of interest for EDA#

Other tools for EDA#

You may find the notebook Other Tools for EDA an interesting tour as well.