Pandas Functions#

So far, we have looked at the basics of using the programming language, python. As we proceed to learning how python is used in data analysis, let’s study some common functions used in pandas library for data manipulation (or data cleaning). It is important to note that this list of methods are not exhastive. Let’s look at some commonly used pandas functions.

# This code just loads up the data we need for the examples below
import sys
sys.path.append('..')
from src.data import load_data

.drop(): In a dataset there can be columns with missing values or columns that are not just useful for your analysis. For example, you might have a New York city taxi dataset containing information (trip id, datetime, trip distance, payment types, passenger count) but want to focus on analyzing trip trends in a day. In this case, the payment types and passenger count are not relevant for our analysis. Having them in our dataset will take up unnecessary space and processing time.

When working with pandas we can solve this by using the drop() method.

#index, specifies that we can use the LocationID column in the data as the dataframe index
import pandas as pd
zone=load_data('taxi_zone_lookup',index_col='LocationID')

#Here, we are dropping all the columns that are not needed for our analysis
#axis=1 identifies the column
#inplace=True does not return a new table but retains the existing table
zone.drop(['service_zone'], axis=1, inplace=True)
zone.head()

/workspaces/py4a/book/data/taxi_zone_lookup.csv

	Borough	Zone
LocationID
1	EWR	Newark Airport
2	Queens	Jamaica Bay
3	Bronx	Allerton/Pelham Gardens
4	Manhattan	Alphabet City
5	Staten Island	Arden Heights

.fillna(): Sometimes, instead of dropping missing values, we might want to replace cells with N/A with some value. To achieve this we use the [.fillna()][def] method. replace() and interpolate() are also used in replacing missing values.

.loc[]: This is an indexing method that is used to simply select particular rows or columns. It selects columns or rows using label names. .loc[] parameter accepts strings, integers, or list. Another indexing method is .iloc[] which selects columns or rows using purely integer-location based indexing for selection by positions. For example, let’s select the Borough columns with some rows from the New York City taxi zone data using loc().

#displays the column "Borough", but only with the rows 3,6,9,15
zone.loc[[3,6,9,15],['Borough']]

	Borough
LocationID
3	Bronx
6	Staten Island
9	Queens
15	Queens

# display all the data for the rows 3-10
zone[3:10]

	Borough	Taxi Zone
LocationID
4	Manhattan	Alphabet City
5	Staten Island	Arden Heights
6	Staten Island	Arrochar/Fort Wadsworth
7	Queens	Astoria
8	Queens	Astoria Park
9	Queens	Auburndale
10	Queens	Baisley Park

# finally just display the Borough column
zone['Borough']

LocationID
              EWR
           Queens
            Bronx
        Manhattan
    Staten Island
           ...      
      Manhattan
      Manhattan
      Manhattan
        Unknown
        Unknown
Name: Borough, Length: 265, dtype: object

.rename(): Sometimes, column names are written in a way that it’s not easy to understand. In such case, we need to rename such column names. The method .rename() takes in couple parameters one of wihich is a dictionary-like object. This dictionary-like object contains a mapping of the old names and new names.

# Rename the Zone column with Taxi Zone
zone.rename(columns={'Zone':'Taxi Zone'}, inplace=True)
zone.head()

	Borough	Taxi Zone
LocationID
1	EWR	Newark Airport
2	Queens	Jamaica Bay
3	Bronx	Allerton/Pelham Gardens
4	Manhattan	Alphabet City
5	Staten Island	Arden Heights

.columns: While not technically a pandas function, this property is helpful when working with a large dataset with so many columns, you want to see all the columns in that dataset. .columns displays all the columns in the dataset.

books_df = load_data('CharlesBookClub')
books_df.head()

/workspaces/py4a/book/data/CharlesBookClub.csv

	Seq#	ID#	Gender	M	R	F	FirstPurch	ChildBks	YouthBks	CookBks	...	ItalCook	Related Purchase	Mcode	Rcode	Fcode	No_Florence
0	1	25	1	297	14	2	22	0	1	1	...	0	0	5	4	2	1
1	2	29	0	128	8	2	10	0	0	0	...	0	0	4	3	2	1
2	3	46	1	138	22	7	56	2	1	2	...	1	2	4	4	3	1
3	4	47	1	228	2	1	2	0	0	0	...	0	0	5	1	1	1
4	5	51	1	257	10	1	10	0	0	0	...	0	0	5	3	1	1

5 rows × 24 columns

Notice here how their is a ... between Cookbks and ItalCook? This shows that not all 24 columns can be displayed in the notebook at once. This can lead us to believe that there are only 20 columns. Using the .columns property gives us a pandas series, (which we can think of as a one column of a dataframe).

#this prints all the columns in the bankruptcy dataset
print(books_df.columns)

Index(['Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
       'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
       'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase',
       'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence'],
      dtype='object')

len(): (Not strictly a pandas fuction). Provides the length (number of rows) of the dataframe. For example

len(books_df)

.dtypes: Because it is useful to know the data type of our variables before analysis. This function will show what kind of data is expected in each column.

# Print the datatypes of the columns in the books dataframe
print(books_df.dtypes)

Seq#                int64
ID#                 int64
Gender              int64
M                   int64
R                   int64
F                   int64
FirstPurch          int64
ChildBks            int64
YouthBks            int64
CookBks             int64
DoItYBks            int64
RefBks              int64
ArtBks              int64
GeogBks             int64
ItalCook            int64
ItalAtlas           int64
ItalArt             int64
Florence            int64
Related Purchase    int64
Mcode               int64
Rcode               int64
Fcode               int64
Yes_Florence        int64
No_Florence         int64
dtype: object

Again the list of pandas functions that are useful in data manipulation is non-exhastive. You can find more pandas functions here.

	Seq#	ID#	Gender	M	R	F	FirstPurch	ChildBks	YouthBks	CookBks	...	ItalCook	Related Purchase	Mcode	Rcode	Fcode	No_Florence
0	1	25	1	297	14	2	22	0	1	1	...	0	0	5	4	2	1
1	2	29	0	128	8	2	10	0	0	0	...	0	0	4	3	2	1
2	3	46	1	138	22	7	56	2	1	2	...	1	2	4	4	3	1
3	4	47	1	228	2	1	2	0	0	0	...	0	0	5	1	1	1
4	5	51	1	257	10	1	10	0	0	0	...	0	0	5	3	1	1

	Seq#	ID#	Gender	M	R	F	FirstPurch	ChildBks	YouthBks	CookBks	...	ItalCook	Related Purchase	Mcode	Rcode	Fcode	No_Florence
0	1	25	1	297	14	2	22	0	1	1	...	0	0	5	4	2	1
1	2	29	0	128	8	2	10	0	0	0	...	0	0	4	3	2	1
2	3	46	1	138	22	7	56	2	1	2	...	1	2	4	4	3	1
3	4	47	1	228	2	1	2	0	0	0	...	0	0	5	1	1	1
4	5	51	1	257	10	1	10	0	0	0	...	0	0	5	3	1	1

Python 4 Analytics

Pandas Functions

Pandas Functions#

	Seq#	ID#	Gender	M	R	F	FirstPurch	ChildBks	YouthBks	CookBks	...	ItalCook	Related Purchase	Mcode	Rcode	Fcode	No_Florence
0	1	25	1	297	14	2	22	0	1	1	...	0	0	5	4	2	1
1	2	29	0	128	8	2	10	0	0	0	...	0	0	4	3	2	1
2	3	46	1	138	22	7	56	2	1	2	...	1	2	4	4	3	1
3	4	47	1	228	2	1	2	0	0	0	...	0	0	5	1	1	1
4	5	51	1	257	10	1	10	0	0	0	...	0	0	5	3	1	1