{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Functions\n", "\n", "So far, we have looked at the basics of using the programming language, python. As we proceed to learning how python is used in data analysis, let's study some common functions used in pandas library for data manipulation (or data cleaning). It is important to note that this list of methods are not exhastive. Let's look at some commonly used pandas functions.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# This code just loads up the data we need for the examples below\n", "import sys\n", "sys.path.append('..')\n", "from src.data import load_data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. [.drop()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop): In a dataset there can be columns with missing values or columns that are not just useful for your analysis. For example, you might have a New York city taxi dataset containing information (trip id, datetime, trip distance, payment types, passenger count) but want to focus on analyzing trip trends in a day. In this case, the payment types and passenger count are not relevant for our analysis. Having them in our dataset will take up unnecessary space and processing time.\n", "\n", "When working with pandas we can solve this by using the drop() method. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/workspaces/py4a/book/data/taxi_zone_lookup.csv\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BoroughZone
LocationID
1EWRNewark Airport
2QueensJamaica Bay
3BronxAllerton/Pelham Gardens
4ManhattanAlphabet City
5Staten IslandArden Heights
\n", "
" ], "text/plain": [ " Borough Zone\n", "LocationID \n", "1 EWR Newark Airport\n", "2 Queens Jamaica Bay\n", "3 Bronx Allerton/Pelham Gardens\n", "4 Manhattan Alphabet City\n", "5 Staten Island Arden Heights" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#index, specifies that we can use the LocationID column in the data as the dataframe index\n", "import pandas as pd\n", "zone=load_data('taxi_zone_lookup',index_col='LocationID')\n", "\n", "#Here, we are dropping all the columns that are not needed for our analysis\n", "#axis=1 identifies the column\n", "#inplace=True does not return a new table but retains the existing table\n", "zone.drop(['service_zone'], axis=1, inplace=True)\n", "zone.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "2. [`.fillna()`][fillna]: Sometimes, instead of dropping missing values, we might want to replace cells with N/A with some value. To achieve this we use the [.fillna()][def] method. [replace()][replace] and [interpolate()][interpolate] are also used in replacing missing values.\n", "\n", "[fillna]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html?highlight=fillna#pandas.DataFrame.fillna\n", "[replace]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html?highlight=replace#pandas.DataFrame.replace\n", "[interpolate]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html?highlight=interpolate#pandas.DataFrame.interpolate" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. [.loc[]][loc_def]: This is an indexing method that is used to simply select particular rows or columns. It selects columns or rows using label names. [.loc[]][loc_def] parameter accepts strings, integers, or list. Another indexing method is [.iloc[]][iloc_def] which selects columns or rows using purely integer-location based indexing for selection by positions. For example, let's select the Borough columns with some rows from the New York City taxi zone data using loc().\n", "\n", "[loc_def]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html?highlight=iloc#pandas.DataFrame.iloc\n", "[iloc_def]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Borough
LocationID
3Bronx
6Staten Island
9Queens
15Queens
\n", "
" ], "text/plain": [ " Borough\n", "LocationID \n", "3 Bronx\n", "6 Staten Island\n", "9 Queens\n", "15 Queens" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#displays the column \"Borough\", but only with the rows 3,6,9,15\n", "zone.loc[[3,6,9,15],['Borough']]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BoroughTaxi Zone
LocationID
4ManhattanAlphabet City
5Staten IslandArden Heights
6Staten IslandArrochar/Fort Wadsworth
7QueensAstoria
8QueensAstoria Park
9QueensAuburndale
10QueensBaisley Park
\n", "
" ], "text/plain": [ " Borough Taxi Zone\n", "LocationID \n", "4 Manhattan Alphabet City\n", "5 Staten Island Arden Heights\n", "6 Staten Island Arrochar/Fort Wadsworth\n", "7 Queens Astoria\n", "8 Queens Astoria Park\n", "9 Queens Auburndale\n", "10 Queens Baisley Park" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# display all the data for the rows 3-10\n", "zone[3:10]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LocationID\n", "1 EWR\n", "2 Queens\n", "3 Bronx\n", "4 Manhattan\n", "5 Staten Island\n", " ... \n", "261 Manhattan\n", "262 Manhattan\n", "263 Manhattan\n", "264 Unknown\n", "265 Unknown\n", "Name: Borough, Length: 265, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# finally just display the Borough column\n", "zone['Borough']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "4. [.rename()][rename_def]: Sometimes, column names are written in a way that it's not easy to understand. In such case, we need to rename such column names. The method [.rename()][rename_def] takes in couple parameters one of wihich is a dictionary-like object. This dictionary-like object contains a mapping of the old names and new names.\n", "\n", "[rename_def]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BoroughTaxi Zone
LocationID
1EWRNewark Airport
2QueensJamaica Bay
3BronxAllerton/Pelham Gardens
4ManhattanAlphabet City
5Staten IslandArden Heights
\n", "
" ], "text/plain": [ " Borough Taxi Zone\n", "LocationID \n", "1 EWR Newark Airport\n", "2 Queens Jamaica Bay\n", "3 Bronx Allerton/Pelham Gardens\n", "4 Manhattan Alphabet City\n", "5 Staten Island Arden Heights" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Rename the Zone column with Taxi Zone\n", "zone.rename(columns={'Zone':'Taxi Zone'}, inplace=True)\n", "zone.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "5. `.columns`: While not technically a pandas function, this property is helpful when working with a large dataset with so many columns, you want to see all the columns in that dataset. `.columns` displays all the columns in the dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/workspaces/py4a/book/data/CharlesBookClub.csv\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Seq#ID#GenderMRFFirstPurchChildBksYouthBksCookBks...ItalCookItalAtlasItalArtFlorenceRelated PurchaseMcodeRcodeFcodeYes_FlorenceNo_Florence
0125129714222011...0000054201
122901288210000...0000043201
2346113822756212...1000244301
34471228212000...0000051101
4551125710110000...0000053101
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks \\\n", "0 1 25 1 297 14 2 22 0 1 1 \n", "1 2 29 0 128 8 2 10 0 0 0 \n", "2 3 46 1 138 22 7 56 2 1 2 \n", "3 4 47 1 228 2 1 2 0 0 0 \n", "4 5 51 1 257 10 1 10 0 0 0 \n", "\n", " ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode \\\n", "0 ... 0 0 0 0 0 5 \n", "1 ... 0 0 0 0 0 4 \n", "2 ... 1 0 0 0 2 4 \n", "3 ... 0 0 0 0 0 5 \n", "4 ... 0 0 0 0 0 5 \n", "\n", " Rcode Fcode Yes_Florence No_Florence \n", "0 4 2 0 1 \n", "1 3 2 0 1 \n", "2 4 3 0 1 \n", "3 1 1 0 1 \n", "4 3 1 0 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "books_df = load_data('CharlesBookClub')\n", "books_df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice here how their is a `...` between `Cookbks` and `ItalCook`? This shows that not all 24 columns can be displayed in the notebook at once. This can lead us to believe that there are only 20 columns. Using the `.columns` property gives us a [pandas series][series], (which we can think of as a one column of a dataframe).\n", "\n", "[series]: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html?highlight=series#pandas.Series" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Seq#', 'ID#', 'Gender', 'M', 'R', 'F', 'FirstPurch', 'ChildBks',\n", " 'YouthBks', 'CookBks', 'DoItYBks', 'RefBks', 'ArtBks', 'GeogBks',\n", " 'ItalCook', 'ItalAtlas', 'ItalArt', 'Florence', 'Related Purchase',\n", " 'Mcode', 'Rcode', 'Fcode', 'Yes_Florence', 'No_Florence'],\n", " dtype='object')\n" ] } ], "source": [ "#this prints all the columns in the bankruptcy dataset\n", "print(books_df.columns)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "6. `len()`: (Not strictly a pandas fuction). Provides the length (number of rows) of the dataframe. For example" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4000" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(books_df)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "7. `.dtypes`: Because it is useful to know the data type of our variables before analysis. This function will show what kind of data is expected in each column." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Seq# int64\n", "ID# int64\n", "Gender int64\n", "M int64\n", "R int64\n", "F int64\n", "FirstPurch int64\n", "ChildBks int64\n", "YouthBks int64\n", "CookBks int64\n", "DoItYBks int64\n", "RefBks int64\n", "ArtBks int64\n", "GeogBks int64\n", "ItalCook int64\n", "ItalAtlas int64\n", "ItalArt int64\n", "Florence int64\n", "Related Purchase int64\n", "Mcode int64\n", "Rcode int64\n", "Fcode int64\n", "Yes_Florence int64\n", "No_Florence int64\n", "dtype: object\n" ] } ], "source": [ "# Print the datatypes of the columns in the books dataframe\n", "print(books_df.dtypes)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Again the list of pandas functions that are useful in data manipulation is non-exhastive. You can find more pandas functions [here](https://pandas.pydata.org/docs/reference/frame.html)." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe" } } }, "nbformat": 4, "nbformat_minor": 2 }