Pandas Functions#

So far, we have looked at the basics of using the programming language, python. As we proceed to learning how python is used in data analysis, let’s study some common functions used in pandas library for data manipulation (or data cleaning). It is important to note that this list of methods are not exhastive. Let’s look at some commonly used pandas functions.

# This code just loads up the data we need for the examples below
import sys
sys.path.append('..')
from src.data import load_data
  1. .drop(): In a dataset there can be columns with missing values or columns that are not just useful for your analysis. For example, you might have a New York city taxi dataset containing information (trip id, datetime, trip distance, payment types, passenger count) but want to focus on analyzing trip trends in a day. In this case, the payment types and passenger count are not relevant for our analysis. Having them in our dataset will take up unnecessary space and processing time.

When working with pandas we can solve this by using the drop() method.

#index, specifies that we can use the LocationID column in the data as the dataframe index
import pandas as pd
zone=load_data('taxi_zone_lookup',index_col='LocationID')

#Here, we are dropping all the columns that are not needed for our analysis
#axis=1 identifies the column
#inplace=True does not return a new table but retains the existing table
zone.drop(['service_zone'], axis=1, inplace=True)
zone.head()
/workspaces/py4a/book/data/taxi_zone_lookup.csv
Borough Zone
LocationID
1 EWR Newark Airport
2 Queens Jamaica Bay
3 Bronx Allerton/Pelham Gardens
4 Manhattan Alphabet City
5 Staten Island Arden Heights
  1. .fillna(): Sometimes, instead of dropping missing values, we might want to replace cells with N/A with some value. To achieve this we use the [.fillna()][def] method. replace() and interpolate() are also used in replacing missing values.

  1. .loc[]: This is an indexing method that is used to simply select particular rows or columns. It selects columns or rows using label names. .loc[] parameter accepts strings, integers, or list. Another indexing method is .iloc[] which selects columns or rows using purely integer-location based indexing for selection by positions. For example, let’s select the Borough columns with some rows from the New York City taxi zone data using loc().

#displays the column "Borough", but only with the rows 3,6,9,15
zone.loc[[3,6,9,15],['Borough']]
Borough
LocationID
3 Bronx
6 Staten Island
9 Queens
15 Queens
# display all the data for the rows 3-10
zone[3:10]
Borough Taxi Zone
LocationID
4 Manhattan Alphabet City
5 Staten Island Arden Heights
6 Staten Island Arrochar/Fort Wadsworth
7 Queens Astoria
8 Queens Astoria Park
9 Queens Auburndale
10 Queens Baisley Park
# finally just display the Borough column
zone['Borough']
LocationID
1                EWR
2             Queens
3              Bronx
4          Manhattan
5      Staten Island
           ...      
261        Manhattan
262        Manhattan
263        Manhattan
264          Unknown
265          Unknown
Name: Borough, Length: 265, dtype: object
  1. .rename(): Sometimes, column names are written in a way that it’s not easy to understand. In such case, we need to rename such column names. The method .rename() takes in couple parameters one of wihich is a dictionary-like object. This dictionary-like object contains a mapping of the old names and new names.

# Rename the Zone column with Taxi Zone
zone.rename(columns={'Zone':'Taxi Zone'}, inplace=True)
zone.head()
Borough Taxi Zone
LocationID
1 EWR Newark Airport
2 Queens Jamaica Bay
3 Bronx Allerton/Pelham Gardens
4 Manhattan Alphabet City
5 Staten Island Arden Heights
  1. .columns: While not technically a pandas function, this property is helpful when working with a large dataset with so many columns, you want to see all the columns in that dataset. .columns displays all the columns in the dataset.

books_df = load_data('CharlesBookClub')
books_df.head()
/workspaces/py4a/book/data/CharlesBookClub.csv
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1

5 rows × 24 columns

Notice here how their is a ... between Cookbks and ItalCook? This shows that not all 24 columns can be displayed in the notebook at once. This can lead us to believe that there are only 20 columns. Using the .columns property gives us a pandas series, (which we can think of as a one column of a dataframe).

#this prints all the columns in the bankruptcy dataset
print(books_df.columns)
Index(['Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
       'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
       'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase',
       'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence'],
      dtype='object')
  1. len(): (Not strictly a pandas fuction). Provides the length (number of rows) of the dataframe. For example

len(books_df)
4000
  1. .dtypes: Because it is useful to know the data type of our variables before analysis. This function will show what kind of data is expected in each column.

# Print the datatypes of the columns in the books dataframe
print(books_df.dtypes)
Seq#                int64
ID#                 int64
Gender              int64
M                   int64
R                   int64
F                   int64
FirstPurch          int64
ChildBks            int64
YouthBks            int64
CookBks             int64
DoItYBks            int64
RefBks              int64
ArtBks              int64
GeogBks             int64
ItalCook            int64
ItalAtlas           int64
ItalArt             int64
Florence            int64
Related Purchase    int64
Mcode               int64
Rcode               int64
Fcode               int64
Yes_Florence        int64
No_Florence         int64
dtype: object

Again the list of pandas functions that are useful in data manipulation is non-exhastive. You can find more pandas functions here.