Meet the Pandas

thomas-bonometti-OyO5NDiRPMM-unsplash.jpg Photo by Thomas Bonometti on Unsplash

We started our space trip to the galaxy of Python Analytics heading onto Jupyter .

Now it’s time to meet some of the most fascinating inhabitants: the pandas

The code, datasets and jupyter notebook for the posts in this series are available in this repository

Using Pandas (basic introduction)

Pandas is a library to work with data using relational tables

To prepare for this lesson execute the following cell

!git clone https://github.com/datasciencedojo/datasets.git
Cloning into 'datasets'...

import the pandas library and assign it a shorter alias

import pandas as pd

Loading data

Pandas includes a rich set of input functions that allow you to get data from various file types

function format notes
pd.read_csv textual csv  
pd.read_excel binary excel format requires external library
pd.read_parquet fast binary columnar format requires pyarrow

A data frame contains many functions to explore it e.g. the .head() method shows the first lines of a data frame

titanic = pd.read_csv("datasets/titanic.csv")
titanic.head()
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 nan S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 nan S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.05 nan S

Projection, Selection and Extension

a data frame is a table; you can get its column names using the .columns attributes

titanic.columns

Index([’PassengerId’, ’Survived’, ’Pclass’, ’Name’, ’Sex’, ’Age’, ’SibSp’, ’Parch’, ’Ticket’, ’Fare’, ’Cabin’, ’Embarked’], dtype=’object’)

columns can be accessed individually or in groups; this operation is called projection

Single columns can be accessed either

  1. using a square bracket operator df.["age"]
  2. using the dot operator if the column name is a good identifier df.age

Each column is called a Series in pandas jargon

Groups of columns can be accessed by passing a list of strings to the bracket operator

titanic[["Survived","Pclass","Sex","Age"]].head()
  Survived Pclass Sex Age
0 0 3 male 22.0
1 1 1 female 38.0
2 1 3 female 26.0
3 1 1 female 35.0
4 0 3 male 35.0

operations on series are vectorized i.e. each individual element is used to get a new vector

Operations within a series and a scalar value are repeated for all values of a series

titanic.Pclass == 1

returns a series of booleans

By passing a list of booleans to the square bracket operators this filters all of the lines which are satisfying the logic statement expressed; this operation is called selection which is a synonim for filter

titanic[titanic.Pclass == 1].head()
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.55 C103 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5 A6 S

Usually selection and projection are used at the same time; the .loc[,] operator can be conveniently used for this purpose; its arguments are:

  1. a boolean list for rows or the splice operator : for no filter
  2. a string list of column names or the splice operator : for all columns
titanic.loc[titanic.Pclass==1,["Survived","Sex","Age"]].head()
  Survived Sex Age
1 1 female 38.0
3 1 female 35.0
6 0 male 54.0
11 1 female 58.0
23 1 male 28.0

It is possible to extend a table with more columns possibly as a result from a computation in other columns

To create a new column, just assign an expression to a new column name e.g.

df["above_average"] = (df.score > df.score.mean())
countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv")
countries.columns

Index([’code’, ’name’, ’continent’, ’region’, ’surface_area’, ’independence_year’, ’population’, ’life_expectancy’, ’gnp’, ’gnp_old’, ’local_name’, ’government_form’, ’head_of_state’, ’capital’, ’code2’], dtype=’object’)

Exercise

calculate the population density of each country

The countries table contains the population size in the population column and the land extension in the surface area

  1. calculate the ratio of these two columns and store it in a new column called population density
countries["population_density"] = countries.population /  countries.surface_area
countries.loc[:,["name","population_density"]].head()
  name population_density
0 Aruba 533.6787564766839
1 Afghanistan 34.84181631369903
2 Angola 10.32967032967033
3 Anguilla 83.33333333333333
4 Albania 118.31083901488799

  1. sort the table in descending order using the .sort_values function
  2. restrict the columns to only the ["name","population_density"] columns
  3. show the first lines of the table using the .head() method: what are the most densely populated countries?
countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()
  name population_density
129 Macao 26277.777777777777
131 Monaco 22666.666666666668
92 Hong Kong 6308.837209302325
186 Singapore 5771.844660194175
79 Gibraltar 4166.666666666667

Join and concatenation

A relation may be composed by more than a table; this may offer some consistency and operation efficiency.

If two tables represents entitites which are related they can be joined by selecting one or more columns which contains those attributes which creates the relationship.

Per each matched rows in a table, this is replicated as many times as the matched rows in the other table

There are four kinds of available joins

join data incuded added missing values
inner only rows which belongs to both tables none
left all rows of the first table for all non matching rows in the first table
right all rows of the second table for all non matching rows in the second table
outer all rows of both tables for all non matching rows

Suppose we have a list of courses, classrooms and classroom booking per each course; if we want to know where each professor should hold his lesson we need to join these tables

course_id title professor
1 quantum field theory Bohr
2 thermodynamics Carnot
3 statistics Gosset
classroom_id building floor
p124 Purple 1
r201 Red 2
course_id classroom_id weekday start end
1 p124 Monday 9 11
1 r201 Wednesday 14 15
2 r201 Tuesday 14 17
3 r201 Monday 14 15
3 p124 Tuesday 9 10
3 p124 Wednesday 9 10

the pd.merge() function performs the join operation e.g.

courses_classrooms = pd.merge(courses,classroom)
courses_bookings = ps.merge(courses_classroom, bookings)

The default kind of join is inner you can use the how= optional argument to choose another kind.

pd.merge will join by default all columns with identical name: if you want to restrict the join to a given list of column you can use the on= option.

If you have different names for the join columns you can use left_on= and right_on= options to match them.

Exercise

  • in the country table we have a list of countries including their population
  • in the languages table we have a list of languages spoken in each country and the percentage of the population which speaks said language
  • in the country table we have a textual code which is uniquely assigned to each county
  • in the languages table we have the same code in a column called country_code
  • load the language table from datasets/WorldDBTables/LanguageTable.csv using the pd.read_csv function and store it in a variable called languages
  • create a table named language_by_country using the pd.merge function and joining the column code of table countries with the column country_code from the languages table
  • calculate the number of people speaking a language by multiplying the population column with the percentage column (don’t forget to divide by 100!); put the result in a column called poeple_speaking
  • show some lines of the table keeping only the following columns: ["name","language","people_speaking","official"] what do you see?
languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")
languages_by_country = pd.merge(
    countries, languages, 
    how="inner", 
    left_on=["code"], right_on=["country_code"]
)
languages_by_country["people_speaking"] = languages_by_country.population * \
    languages_by_country.percentage / 100
languages_by_country[["name","language","people_speaking","official"]].head()
  name language people_speaking official
0 Aruba Dutch 5459.0 T
1 Aruba English 9785.0 F
2 Aruba Papiamento 79001.0 F
3 Aruba Spanish 7622.0 F
4 Afghanistan Balochi 204480.0 F

Concatenation

It may happen that your data is collected in separated dataframes with the same columns ans you need to create a single one from all of them.

unzip ROMA.zip TG_SOUID100860.txt
unzip BARI.zip TG_SOUID245914.txt
roma = pd.read_csv("TG_SOUID100860.txt",skiprows=20)
roma.head()
  SOUID DATE TG Q_TG
0 100860 19510101 76 0
1 100860 19510102 108 0
2 100860 19510103 116 0
3 100860 19510104 115 0
4 100860 19510105 82 0
bari = pd.read_csv("TG_SOUID245914.txt",skiprows=20)
bari.head()
  SOUID DATE TG Q_TG
0 245914 20211201 -9999 9
1 245914 20211202 -9999 9
2 245914 20211203 -9999 9
3 245914 20211204 -9999 9
4 245914 20211205 -9999 9

the pd.concat() function can concatenate a list of data frames; the default behavior is consistent with the semantic of relations and it retunrns a single data frame:

  • columns will be the union of all columns of each individual data frame in the input
  • rows will keep the same order as the data frames
temperatures = pd.concat([roma,bari])

# this will remove extra spaces from column names
temperatures.columns = list(map(str.strip,temperatures.columns))

# this will transform the column type
for col in ["SOUID","Q_TG"]:
    temperatures[col] = temperatures[col].astype("category")
temperatures["DATE"]=pd.to_datetime(temperatures["DATE"],format="%Y%m%d")
print(temperatures.Q_TG.unique())
temperatures.loc[temperatures.Q_TG != 9,:].describe(include="all")
  SOUID DATE TG Q_TG  
count 21717.0 np.int64 (21717) 21717.0 21717.0
unique 2.0 nan nan 2.0  
top 100860.0 nan nan 0.0  
freq 21525.0 nan nan 21711.0  
mean nan Timestamp (1980-11-11 03:17:47.716535360) 154.8837316388083 nan
min nan Timestamp (1951-01-01 00:00:00) -56.0 nan
25% nan Timestamp (1965-11-17 00:00:00) 101.0 nan
50% nan Timestamp (1980-09-28 00:00:00) 150.0 nan
75% nan Timestamp (1995-08-22 00:00:00) 212.0 nan
max nan Timestamp (2022-10-18 00:00:00) 327.0 nan
std nan nan 66.53937042433274 nan  
[0, 9, 1]
Categories (3, int64): [0, 1, 9]

Aggregation

very often you may want to group your data according to one or more attribute and perform some calculation on each group, this operation is called aggregation

e.g. suppose I want to split a restaurant bill with my friends and I have a dataframe which looks like the following table

person item amount
me pepperoni pizza 12
me lager pils 5
andrea cheeseburger 10
andrea coca cola 2
andrea french fries 2
groups = bill.groupby(["person"])
groups.agg({"amount":"sum"})

will return

person amount
me 17
andrea 14

It is also possible to ask for multiple aggregation by using a list of functions

g = titanic.groupby(["Pclass","Sex"])
age_summary = g.agg({"Age":["min","max","mean"]})
age_summary
  (Age min) (Age max) (Age mean)
(1 female) 2.0 63.0 34.61176470588235
(1 male) 0.92 80.0 41.28138613861386
(2 female) 2.0 57.0 28.722972972972972
(2 male) 0.67 70.0 30.74070707070707
(3 female) 0.75 63.0 21.75
(3 male) 0.42 74.0 26.507588932806325

Please note that here the generated columns are accessible using a tuple i.e.

age_summary[("Age","mean")]

Exercise

using the languages_by_country table we created in the previous exercise

  1. create a grouping by using the "language" column
  2. using the .agg() method calculate how many people speak each language
  3. sort the dataset from the largest group descending
  4. show the first lines using .head() method
g = languages_by_country.groupby(["language"])
languages_spoken = g.agg({"people_speaking":"sum"})
languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False)
languages_spoken_sorted.head(20)
language people_speaking
Chinese 1190152805.0
Hindi 405619174.0
Spanish 307997398.0
Bengali 209304719.0
Arabic 205490840.7
Portuguese 176981914.4
Japanese 126254034.0
Punjabi 104025371.0
English 91616031.3
Javanese 83570158.0
Telugu 79065636.0
Marathi 75010988.0
Korean 71450757.0
Vietnamese 69908416.0
Tamil 68682272.0
French 67947730.0
Urdu 63589470.0
Italian 57183654.1
Gujarati 48655776.0
Malay 41517994.0
g = languages_by_country.groupby(["continent","language"])
languages_spoken = g.agg({"people_speaking":"sum"})
languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False)
languages_spoken_sorted.head(20)
  people_speaking
(Asia Chinese) 1189353427.0
(Asia Hindi) 405169038.0
(Asia Bengali) 209304719.0
(South America Portuguese) 166037997.0
(South America Spanish) 145620868.0
(Africa Arabic) 134392131.0
(North America Spanish) 132707046.0
(Asia Japanese) 125573574.0
(Asia Punjabi) 103807342.0
(Asia Javanese) 83570158.0
(Asia Telugu) 79065636.0
(Asia Marathi) 75010988.0
(Asia Korean) 71445687.0
(Asia Vietnamese) 69908416.0
(Asia Arabic) 69184280.0
(Asia Tamil) 68682272.0
(Asia Urdu) 63589470.0
(Europe English) 61799068.300000004
(Europe French) 60455448.0
(Europe Italian) 55344151.1
languages_spoken_sorted=languages_spoken_sorted.reset_index()
g = languages_spoken_sorted.groupby(["continent"])
result = []
for i,subtable in g:
    result.append(subtable.head(3).reset_index())
pd.concat(result).head()
  index continent language people_speaking
0 5 Africa Arabic 134392131.0
1 32 Africa Hausa 29225396.0
2 33 Africa Joruba 24868874.0
0 0 Asia Chinese 1189353427.0
1 1 Asia Hindi 405169038.0

Translate the content of a table

Suppose we need to translate some foreign language content

resources = pd.read_csv("ds523_consumoacquaenergia.csv",sep=";")
resources.head()
  anno Consumo pro capite tipo Consumo pro capite
0 2011 Energia elettrica per uso domestico 1196.1
1 2011 Gas metano per uso domestico e riscaldamento 377.9
2 2011 Acqua fatturata per uso domestico 83.1
3 2010 Energia elettrica per uso domestico 1200.7
4 2010 Gas metano per uso domestico e riscaldamento 406.2

The second column looks like a categorical series, so let’s check it

resources["Consumo pro capite tipo"].unique()

[’Energia elettrica per uso domestico’ ’Gas metano per uso domestico e riscaldamento’ ’Acqua fatturata per uso domestico’]

we can pass a dictionary to the .map() method like this:

translate = {
    'Energia elettrica per uso domestico':'electricity',
    'Gas metano per uso domestico e riscaldamento':'methan',
    'Acqua fatturata per uso domestico':'water'
}
resources["type"] = resources["Consumo pro capite tipo"].map(translate)

Also columns can be renamed or removed

resources = resources.rename({"anno":"year","Consumo pro capite":"usage per person"}, axis="columns")
del resources["Consumo pro capite tipo"]
resources.head()
  year usage per person type
0 2011 1196.1 electricity
1 2011 377.9 methan
2 2011 83.1 water
3 2010 1200.7 electricity
4 2010 406.2 methan

Pivoting and melting

Pivot is a family of aggregation functions whose main purpose is to collect data from a relation and aggregate them by using one or more attribute columns.

This process will create a column per each combination of the attributes; the result table is sometime referred as “wide format” table or “two entries table”; let’s make an example

resources2 = resources.pivot(index="year",columns="type",values="usage per person").reset_index()
resources2
  year electricity methan water
0 2000.0 1130.2 509.0 92.1
1 2001.0 1143.9 500.7 91.3
2 2002.0 1195.5 504.2 90.4
3 2003.0 1222.8 480.2 87.3
4 2004.0 1228.6 442.4 80.4
5 2005.0 1225.0 434.5 81.3
6 2006.0 1219.7 431.3 82.2
7 2007.0 1197.0 381.1 81.6
8 2008.0 1203.0 384.9 84.5
9 2009.0 1202.9 389.6 85.8
10 2010.0 1200.7 406.2 83.2
11 2011.0 1196.1 377.9 83.1

As there was exactly one value per each year and each commodity the previous example just moved values without performing any calculation.

Suppose now we want to split some restaurant bill

bill = pd.DataFrame([
    {"item":"pepperoni pizza", "amount":12, "customer": "Marco", "day": "Monday"},
    {"item":"beer", "amount":7.5, "customer": "Marco", "day": "Monday"},
    {"item":"coffee", "amount":1.2, "customer": "Marco", "day": "Monday"},
    {"item":"pizza margherita", "amount":10, "customer": "Luca", "day": "Monday"},
    {"item":"wine", "amount":10, "customer": "Luca", "day": "Monday"},
    {"item":"steak", "amount":20, "customer": "Marco", "day": "Tuesday"},
    {"item":"bottled water", "amount":5, "customer": "Marco", "day": "Tuesday"},
])
bill
  item amount customer day
0 pepperoni pizza 12.0 Marco Monday
1 beer 7.5 Marco Monday
2 coffee 1.2 Marco Monday
3 pizza margherita 10.0 Luca Monday
4 wine 10.0 Luca Monday
5 steak 20.0 Marco Tuesday
6 bottled water 5.0 Marco Tuesday

pandas function pivot_table allows to define an aggregation function in case of collision

splitted_bill = pd.pivot_table(bill,index="day",values="amount",columns="customer",aggfunc="sum")
splitted_bill
day Luca Marco
Monday 20.0 20.7
Tuesday nan 25.0

pandas pd.melt() function provides a way to get a “long format” table

pd.melt(splitted_bill)
  customer value
0 Luca 20.0
1 Luca nan
2 Marco 20.7
3 Marco 25.0

 

marco.p.v.vezzoli

Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions

You may also like...

Leave a Reply