Pandas Functions
Pandas Functions#
So far, we have looked at the basics of using the programming language, python. As we proceed to learning how python is used in data analysis, let’s study some common functions used in pandas library for data manipulation (or data cleaning). It is important to note that this list of methods are not exhastive. Let’s look at some commonly used pandas functions.
# This code just loads up the data we need for the examples below
import sys
sys.path.append('..')
from src.data import load_data
.drop(): In a dataset there can be columns with missing values or columns that are not just useful for your analysis. For example, you might have a New York city taxi dataset containing information (trip id, datetime, trip distance, payment types, passenger count) but want to focus on analyzing trip trends in a day. In this case, the payment types and passenger count are not relevant for our analysis. Having them in our dataset will take up unnecessary space and processing time.
When working with pandas we can solve this by using the drop() method.
#index, specifies that we can use the LocationID column in the data as the dataframe index
import pandas as pd
zone=load_data('taxi_zone_lookup',index_col='LocationID')
#Here, we are dropping all the columns that are not needed for our analysis
#axis=1 identifies the column
#inplace=True does not return a new table but retains the existing table
zone.drop(['service_zone'], axis=1, inplace=True)
zone.head()
/workspaces/py4a/book/data/taxi_zone_lookup.csv
Borough | Zone | |
---|---|---|
LocationID | ||
1 | EWR | Newark Airport |
2 | Queens | Jamaica Bay |
3 | Bronx | Allerton/Pelham Gardens |
4 | Manhattan | Alphabet City |
5 | Staten Island | Arden Heights |
.fillna()
: Sometimes, instead of dropping missing values, we might want to replace cells with N/A with some value. To achieve this we use the [.fillna()][def] method. replace() and interpolate() are also used in replacing missing values.
.loc[]: This is an indexing method that is used to simply select particular rows or columns. It selects columns or rows using label names. .loc[] parameter accepts strings, integers, or list. Another indexing method is .iloc[] which selects columns or rows using purely integer-location based indexing for selection by positions. For example, let’s select the Borough columns with some rows from the New York City taxi zone data using loc().
#displays the column "Borough", but only with the rows 3,6,9,15
zone.loc[[3,6,9,15],['Borough']]
Borough | |
---|---|
LocationID | |
3 | Bronx |
6 | Staten Island |
9 | Queens |
15 | Queens |
# display all the data for the rows 3-10
zone[3:10]
Borough | Taxi Zone | |
---|---|---|
LocationID | ||
4 | Manhattan | Alphabet City |
5 | Staten Island | Arden Heights |
6 | Staten Island | Arrochar/Fort Wadsworth |
7 | Queens | Astoria |
8 | Queens | Astoria Park |
9 | Queens | Auburndale |
10 | Queens | Baisley Park |
# finally just display the Borough column
zone['Borough']
LocationID
1 EWR
2 Queens
3 Bronx
4 Manhattan
5 Staten Island
...
261 Manhattan
262 Manhattan
263 Manhattan
264 Unknown
265 Unknown
Name: Borough, Length: 265, dtype: object
.rename(): Sometimes, column names are written in a way that it’s not easy to understand. In such case, we need to rename such column names. The method .rename() takes in couple parameters one of wihich is a dictionary-like object. This dictionary-like object contains a mapping of the old names and new names.
# Rename the Zone column with Taxi Zone
zone.rename(columns={'Zone':'Taxi Zone'}, inplace=True)
zone.head()
Borough | Taxi Zone | |
---|---|---|
LocationID | ||
1 | EWR | Newark Airport |
2 | Queens | Jamaica Bay |
3 | Bronx | Allerton/Pelham Gardens |
4 | Manhattan | Alphabet City |
5 | Staten Island | Arden Heights |
.columns
: While not technically a pandas function, this property is helpful when working with a large dataset with so many columns, you want to see all the columns in that dataset..columns
displays all the columns in the dataset.
books_df = load_data('CharlesBookClub')
books_df.head()
/workspaces/py4a/book/data/CharlesBookClub.csv
Seq# | ID# | Gender | M | R | F | FirstPurch | ChildBks | YouthBks | CookBks | ... | ItalCook | ItalAtlas | ItalArt | Florence | Related Purchase | Mcode | Rcode | Fcode | Yes_Florence | No_Florence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 25 | 1 | 297 | 14 | 2 | 22 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 5 | 4 | 2 | 0 | 1 |
1 | 2 | 29 | 0 | 128 | 8 | 2 | 10 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 4 | 3 | 2 | 0 | 1 |
2 | 3 | 46 | 1 | 138 | 22 | 7 | 56 | 2 | 1 | 2 | ... | 1 | 0 | 0 | 0 | 2 | 4 | 4 | 3 | 0 | 1 |
3 | 4 | 47 | 1 | 228 | 2 | 1 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 1 | 0 | 1 |
4 | 5 | 51 | 1 | 257 | 10 | 1 | 10 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 5 | 3 | 1 | 0 | 1 |
5 rows × 24 columns
Notice here how their is a ...
between Cookbks
and ItalCook
? This shows that not all 24 columns can be displayed in the notebook at once. This can lead us to believe that there are only 20 columns. Using the .columns
property gives us a pandas series, (which we can think of as a one column of a dataframe).
#this prints all the columns in the bankruptcy dataset
print(books_df.columns)
Index(['Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',
'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',
'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase',
'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence'],
dtype='object')
len()
: (Not strictly a pandas fuction). Provides the length (number of rows) of the dataframe. For example
len(books_df)
4000
.dtypes
: Because it is useful to know the data type of our variables before analysis. This function will show what kind of data is expected in each column.
# Print the datatypes of the columns in the books dataframe
print(books_df.dtypes)
Seq# int64
ID# int64
Gender int64
M int64
R int64
F int64
FirstPurch int64
ChildBks int64
YouthBks int64
CookBks int64
DoItYBks int64
RefBks int64
ArtBks int64
GeogBks int64
ItalCook int64
ItalAtlas int64
ItalArt int64
Florence int64
Related Purchase int64
Mcode int64
Rcode int64
Fcode int64
Yes_Florence int64
No_Florence int64
dtype: object
Again the list of pandas functions that are useful in data manipulation is non-exhastive. You can find more pandas functions here.