Meet the Pandas
Photo by Thomas Bonometti on Unsplash
We started our space trip to the galaxy of Python Analytics heading onto Jupyter .
Now it’s time to meet some of the most fascinating inhabitants: the pandas
The code, datasets and jupyter notebook for the posts in this series are available in this repository
Using Pandas (basic introduction)
Pandas is a library to work with data using relational tables
To prepare for this lesson execute the following cell
!git clone https://github.com/datasciencedojo/datasets.git
Cloning into 'datasets'...
import the pandas library and assign it a shorter alias
import pandas as pd
Loading data
Pandas includes a rich set of input functions that allow you to get data from various file types
function | format | notes |
---|---|---|
pd.read_csv |
textual csv | |
pd.read_excel |
binary excel format | requires external library |
pd.read_parquet |
fast binary columnar format | requires pyarrow |
A data frame contains many functions to explore it e.g. the .head()
method shows the first lines of a data frame
titanic = pd.read_csv("datasets/titanic.csv") titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.25 | nan | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | nan | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.05 | nan | S |
Projection, Selection and Extension
a data frame is a table; you can get its column names using the .columns
attributes
titanic.columns
Index([’PassengerId’, ’Survived’, ’Pclass’, ’Name’, ’Sex’, ’Age’, ’SibSp’, ’Parch’, ’Ticket’, ’Fare’, ’Cabin’, ’Embarked’], dtype=’object’)
columns can be accessed individually or in groups; this operation is called projection
Single columns can be accessed either
- using a square bracket operator
df.["age"]
- using the dot operator if the column name is a good identifier
df.age
Each column is called a Series in pandas jargon
Groups of columns can be accessed by passing a list of strings to the bracket operator
titanic[["Survived","Pclass","Sex","Age"]].head()
Survived | Pclass | Sex | Age | |
---|---|---|---|---|
0 | 0 | 3 | male | 22.0 |
1 | 1 | 1 | female | 38.0 |
2 | 1 | 3 | female | 26.0 |
3 | 1 | 1 | female | 35.0 |
4 | 0 | 3 | male | 35.0 |
operations on series are vectorized i.e. each individual element is used to get a new vector
Operations within a series and a scalar value are repeated for all values of a series
titanic.Pclass == 1
returns a series of booleans
By passing a list of booleans to the square bracket operators this filters all of the lines which are satisfying the logic statement expressed; this operation is called selection which is a synonim for filter
titanic[titanic.Pclass == 1].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.55 | C103 | S |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5 | A6 | S |
Usually selection and projection are used at the same time; the .loc[,]
operator can be conveniently used for this purpose; its arguments are:
- a boolean list for rows or the splice operator
:
for no filter - a string list of column names or the splice operator
:
for all columns
titanic.loc[titanic.Pclass==1,["Survived","Sex","Age"]].head()
Survived | Sex | Age | |
---|---|---|---|
1 | 1 | female | 38.0 |
3 | 1 | female | 35.0 |
6 | 0 | male | 54.0 |
11 | 1 | female | 58.0 |
23 | 1 | male | 28.0 |
It is possible to extend a table with more columns possibly as a result from a computation in other columns
To create a new column, just assign an expression to a new column name e.g.
df["above_average"] = (df.score > df.score.mean())
countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv") countries.columns
Index([’code’, ’name’, ’continent’, ’region’, ’surface_area’, ’independence_year’, ’population’, ’life_expectancy’, ’gnp’, ’gnp_old’, ’local_name’, ’government_form’, ’head_of_state’, ’capital’, ’code2’], dtype=’object’)
Exercise
calculate the population density of each country
The countries table contains the population size in the population
column and the land extension in the surface area
- calculate the ratio of these two columns and store it in a new column called
population density
countries["population_density"] = countries.population / countries.surface_area countries.loc[:,["name","population_density"]].head()
name | population_density | |
---|---|---|
0 | Aruba | 533.6787564766839 |
1 | Afghanistan | 34.84181631369903 |
2 | Angola | 10.32967032967033 |
3 | Anguilla | 83.33333333333333 |
4 | Albania | 118.31083901488799 |
- sort the table in descending order using the
.sort_values
function - restrict the columns to only the
["name","population_density"]
columns - show the first lines of the table using the
.head()
method: what are the most densely populated countries?
countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()
name | population_density | |
---|---|---|
129 | Macao | 26277.777777777777 |
131 | Monaco | 22666.666666666668 |
92 | Hong Kong | 6308.837209302325 |
186 | Singapore | 5771.844660194175 |
79 | Gibraltar | 4166.666666666667 |
Join and concatenation
A relation may be composed by more than a table; this may offer some consistency and operation efficiency.
If two tables represents entitites which are related they can be joined by selecting one or more columns which contains those attributes which creates the relationship.
Per each matched rows in a table, this is replicated as many times as the matched rows in the other table
There are four kinds of available joins
join | data incuded | added missing values |
---|---|---|
inner | only rows which belongs to both tables | none |
left | all rows of the first table | for all non matching rows in the first table |
right | all rows of the second table | for all non matching rows in the second table |
outer | all rows of both tables | for all non matching rows |
Suppose we have a list of courses, classrooms and classroom booking per each course; if we want to know where each professor should hold his lesson we need to join these tables
course_id | title | professor |
---|---|---|
1 | quantum field theory | Bohr |
2 | thermodynamics | Carnot |
3 | statistics | Gosset |
classroom_id | building | floor |
---|---|---|
p124 | Purple | 1 |
r201 | Red | 2 |
course_id | classroom_id | weekday | start | end |
---|---|---|---|---|
1 | p124 | Monday | 9 | 11 |
1 | r201 | Wednesday | 14 | 15 |
2 | r201 | Tuesday | 14 | 17 |
3 | r201 | Monday | 14 | 15 |
3 | p124 | Tuesday | 9 | 10 |
3 | p124 | Wednesday | 9 | 10 |
the pd.merge()
function performs the join operation e.g.
courses_classrooms = pd.merge(courses,classroom) courses_bookings = ps.merge(courses_classroom, bookings)
The default kind of join is inner
you can use the how=
optional argument to choose another kind.
pd.merge
will join by default all columns with identical name: if you want to restrict the join to a given list of column you can use the on=
option.
If you have different names for the join columns you can use left_on=
and right_on=
options to match them.
Exercise
- in the country table we have a list of countries including their population
- in the languages table we have a list of languages spoken in each country and the percentage of the population which speaks said language
- in the country table we have a textual
code
which is uniquely assigned to each county - in the languages table we have the same code in a column called
country_code
- load the language table from
datasets/WorldDBTables/LanguageTable.csv
using thepd.read_csv
function and store it in a variable calledlanguages
- create a table named
language_by_country
using thepd.merge
function and joining the columncode
of tablecountries
with the columncountry_code
from thelanguages
table - calculate the number of people speaking a language by multiplying the
population
column with thepercentage
column (don’t forget to divide by 100!); put the result in a column calledpoeple_speaking
- show some lines of the table keeping only the following columns:
["name","language","people_speaking","official"]
what do you see?
languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")
languages_by_country = pd.merge( countries, languages, how="inner", left_on=["code"], right_on=["country_code"] ) languages_by_country["people_speaking"] = languages_by_country.population * \ languages_by_country.percentage / 100 languages_by_country[["name","language","people_speaking","official"]].head()
name | language | people_speaking | official | |
---|---|---|---|---|
0 | Aruba | Dutch | 5459.0 | T |
1 | Aruba | English | 9785.0 | F |
2 | Aruba | Papiamento | 79001.0 | F |
3 | Aruba | Spanish | 7622.0 | F |
4 | Afghanistan | Balochi | 204480.0 | F |
Concatenation
It may happen that your data is collected in separated dataframes with the same columns ans you need to create a single one from all of them.
unzip ROMA.zip TG_SOUID100860.txt
unzip BARI.zip TG_SOUID245914.txt
roma = pd.read_csv("TG_SOUID100860.txt",skiprows=20) roma.head()
SOUID | DATE | TG | Q_TG | |
---|---|---|---|---|
0 | 100860 | 19510101 | 76 | 0 |
1 | 100860 | 19510102 | 108 | 0 |
2 | 100860 | 19510103 | 116 | 0 |
3 | 100860 | 19510104 | 115 | 0 |
4 | 100860 | 19510105 | 82 | 0 |
bari = pd.read_csv("TG_SOUID245914.txt",skiprows=20) bari.head()
SOUID | DATE | TG | Q_TG | |
---|---|---|---|---|
0 | 245914 | 20211201 | -9999 | 9 |
1 | 245914 | 20211202 | -9999 | 9 |
2 | 245914 | 20211203 | -9999 | 9 |
3 | 245914 | 20211204 | -9999 | 9 |
4 | 245914 | 20211205 | -9999 | 9 |
the pd.concat()
function can concatenate a list of data frames; the default behavior is consistent with the semantic of relations and it retunrns a single data frame:
- columns will be the union of all columns of each individual data frame in the input
- rows will keep the same order as the data frames
temperatures = pd.concat([roma,bari]) # this will remove extra spaces from column names temperatures.columns = list(map(str.strip,temperatures.columns)) # this will transform the column type for col in ["SOUID","Q_TG"]: temperatures[col] = temperatures[col].astype("category") temperatures["DATE"]=pd.to_datetime(temperatures["DATE"],format="%Y%m%d") print(temperatures.Q_TG.unique()) temperatures.loc[temperatures.Q_TG != 9,:].describe(include="all")
SOUID | DATE | TG | Q_TG | ||
---|---|---|---|---|---|
count | 21717.0 | np.int64 | (21717) | 21717.0 | 21717.0 |
unique | 2.0 | nan | nan | 2.0 | |
top | 100860.0 | nan | nan | 0.0 | |
freq | 21525.0 | nan | nan | 21711.0 | |
mean | nan | Timestamp | (1980-11-11 03:17:47.716535360) | 154.8837316388083 | nan |
min | nan | Timestamp | (1951-01-01 00:00:00) | -56.0 | nan |
25% | nan | Timestamp | (1965-11-17 00:00:00) | 101.0 | nan |
50% | nan | Timestamp | (1980-09-28 00:00:00) | 150.0 | nan |
75% | nan | Timestamp | (1995-08-22 00:00:00) | 212.0 | nan |
max | nan | Timestamp | (2022-10-18 00:00:00) | 327.0 | nan |
std | nan | nan | 66.53937042433274 | nan |
[0, 9, 1] Categories (3, int64): [0, 1, 9]
Aggregation
very often you may want to group your data according to one or more attribute and perform some calculation on each group, this operation is called aggregation
e.g. suppose I want to split a restaurant bill with my friends and I have a dataframe which looks like the following table
person | item | amount |
---|---|---|
me | pepperoni pizza | 12 |
me | lager pils | 5 |
andrea | cheeseburger | 10 |
andrea | coca cola | 2 |
andrea | french fries | 2 |
groups = bill.groupby(["person"]) groups.agg({"amount":"sum"})
will return
person | amount |
---|---|
me | 17 |
andrea | 14 |
It is also possible to ask for multiple aggregation by using a list of functions
g = titanic.groupby(["Pclass","Sex"]) age_summary = g.agg({"Age":["min","max","mean"]}) age_summary
(Age min) | (Age max) | (Age mean) | |
---|---|---|---|
(1 female) | 2.0 | 63.0 | 34.61176470588235 |
(1 male) | 0.92 | 80.0 | 41.28138613861386 |
(2 female) | 2.0 | 57.0 | 28.722972972972972 |
(2 male) | 0.67 | 70.0 | 30.74070707070707 |
(3 female) | 0.75 | 63.0 | 21.75 |
(3 male) | 0.42 | 74.0 | 26.507588932806325 |
Please note that here the generated columns are accessible using a tuple i.e.
age_summary[("Age","mean")]
Exercise
using the languages_by_country
table we created in the previous exercise
- create a grouping by using the
"language"
column - using the
.agg()
method calculate how many people speak each language - sort the dataset from the largest group descending
- show the first lines using
.head()
method
g = languages_by_country.groupby(["language"]) languages_spoken = g.agg({"people_speaking":"sum"}) languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False) languages_spoken_sorted.head(20)
language | people_speaking |
---|---|
Chinese | 1190152805.0 |
Hindi | 405619174.0 |
Spanish | 307997398.0 |
Bengali | 209304719.0 |
Arabic | 205490840.7 |
Portuguese | 176981914.4 |
Japanese | 126254034.0 |
Punjabi | 104025371.0 |
English | 91616031.3 |
Javanese | 83570158.0 |
Telugu | 79065636.0 |
Marathi | 75010988.0 |
Korean | 71450757.0 |
Vietnamese | 69908416.0 |
Tamil | 68682272.0 |
French | 67947730.0 |
Urdu | 63589470.0 |
Italian | 57183654.1 |
Gujarati | 48655776.0 |
Malay | 41517994.0 |
g = languages_by_country.groupby(["continent","language"]) languages_spoken = g.agg({"people_speaking":"sum"}) languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False) languages_spoken_sorted.head(20)
people_speaking | |
---|---|
(Asia Chinese) | 1189353427.0 |
(Asia Hindi) | 405169038.0 |
(Asia Bengali) | 209304719.0 |
(South America Portuguese) | 166037997.0 |
(South America Spanish) | 145620868.0 |
(Africa Arabic) | 134392131.0 |
(North America Spanish) | 132707046.0 |
(Asia Japanese) | 125573574.0 |
(Asia Punjabi) | 103807342.0 |
(Asia Javanese) | 83570158.0 |
(Asia Telugu) | 79065636.0 |
(Asia Marathi) | 75010988.0 |
(Asia Korean) | 71445687.0 |
(Asia Vietnamese) | 69908416.0 |
(Asia Arabic) | 69184280.0 |
(Asia Tamil) | 68682272.0 |
(Asia Urdu) | 63589470.0 |
(Europe English) | 61799068.300000004 |
(Europe French) | 60455448.0 |
(Europe Italian) | 55344151.1 |
languages_spoken_sorted=languages_spoken_sorted.reset_index() g = languages_spoken_sorted.groupby(["continent"]) result = [] for i,subtable in g: result.append(subtable.head(3).reset_index()) pd.concat(result).head()
index | continent | language | people_speaking | |
---|---|---|---|---|
0 | 5 | Africa | Arabic | 134392131.0 |
1 | 32 | Africa | Hausa | 29225396.0 |
2 | 33 | Africa | Joruba | 24868874.0 |
0 | 0 | Asia | Chinese | 1189353427.0 |
1 | 1 | Asia | Hindi | 405169038.0 |
Translate the content of a table
Suppose we need to translate some foreign language content
resources = pd.read_csv("ds523_consumoacquaenergia.csv",sep=";") resources.head()
anno | Consumo pro capite tipo | Consumo pro capite | |
---|---|---|---|
0 | 2011 | Energia elettrica per uso domestico | 1196.1 |
1 | 2011 | Gas metano per uso domestico e riscaldamento | 377.9 |
2 | 2011 | Acqua fatturata per uso domestico | 83.1 |
3 | 2010 | Energia elettrica per uso domestico | 1200.7 |
4 | 2010 | Gas metano per uso domestico e riscaldamento | 406.2 |
The second column looks like a categorical series, so let’s check it
resources["Consumo pro capite tipo"].unique()
[’Energia elettrica per uso domestico’ ’Gas metano per uso domestico e riscaldamento’ ’Acqua fatturata per uso domestico’]
we can pass a dictionary to the .map()
method like this:
translate = { 'Energia elettrica per uso domestico':'electricity', 'Gas metano per uso domestico e riscaldamento':'methan', 'Acqua fatturata per uso domestico':'water' } resources["type"] = resources["Consumo pro capite tipo"].map(translate)
Also columns can be renamed or removed
resources = resources.rename({"anno":"year","Consumo pro capite":"usage per person"}, axis="columns") del resources["Consumo pro capite tipo"] resources.head()
year | usage per person | type | |
---|---|---|---|
0 | 2011 | 1196.1 | electricity |
1 | 2011 | 377.9 | methan |
2 | 2011 | 83.1 | water |
3 | 2010 | 1200.7 | electricity |
4 | 2010 | 406.2 | methan |
Pivoting and melting
Pivot is a family of aggregation functions whose main purpose is to collect data from a relation and aggregate them by using one or more attribute columns.
This process will create a column per each combination of the attributes; the result table is sometime referred as “wide format” table or “two entries table”; let’s make an example
resources2 = resources.pivot(index="year",columns="type",values="usage per person").reset_index() resources2
year | electricity | methan | water | |
---|---|---|---|---|
0 | 2000.0 | 1130.2 | 509.0 | 92.1 |
1 | 2001.0 | 1143.9 | 500.7 | 91.3 |
2 | 2002.0 | 1195.5 | 504.2 | 90.4 |
3 | 2003.0 | 1222.8 | 480.2 | 87.3 |
4 | 2004.0 | 1228.6 | 442.4 | 80.4 |
5 | 2005.0 | 1225.0 | 434.5 | 81.3 |
6 | 2006.0 | 1219.7 | 431.3 | 82.2 |
7 | 2007.0 | 1197.0 | 381.1 | 81.6 |
8 | 2008.0 | 1203.0 | 384.9 | 84.5 |
9 | 2009.0 | 1202.9 | 389.6 | 85.8 |
10 | 2010.0 | 1200.7 | 406.2 | 83.2 |
11 | 2011.0 | 1196.1 | 377.9 | 83.1 |
As there was exactly one value per each year and each commodity the previous example just moved values without performing any calculation.
Suppose now we want to split some restaurant bill
bill = pd.DataFrame([ {"item":"pepperoni pizza", "amount":12, "customer": "Marco", "day": "Monday"}, {"item":"beer", "amount":7.5, "customer": "Marco", "day": "Monday"}, {"item":"coffee", "amount":1.2, "customer": "Marco", "day": "Monday"}, {"item":"pizza margherita", "amount":10, "customer": "Luca", "day": "Monday"}, {"item":"wine", "amount":10, "customer": "Luca", "day": "Monday"}, {"item":"steak", "amount":20, "customer": "Marco", "day": "Tuesday"}, {"item":"bottled water", "amount":5, "customer": "Marco", "day": "Tuesday"}, ]) bill
item | amount | customer | day | |
---|---|---|---|---|
0 | pepperoni pizza | 12.0 | Marco | Monday |
1 | beer | 7.5 | Marco | Monday |
2 | coffee | 1.2 | Marco | Monday |
3 | pizza margherita | 10.0 | Luca | Monday |
4 | wine | 10.0 | Luca | Monday |
5 | steak | 20.0 | Marco | Tuesday |
6 | bottled water | 5.0 | Marco | Tuesday |
pandas function pivot_table
allows to define an aggregation function in case of collision
splitted_bill = pd.pivot_table(bill,index="day",values="amount",columns="customer",aggfunc="sum") splitted_bill
day | Luca | Marco |
---|---|---|
Monday | 20.0 | 20.7 |
Tuesday | nan | 25.0 |
pandas pd.melt()
function provides a way to get a “long format” table
pd.melt(splitted_bill)
customer | value | |
---|---|---|
0 | Luca | 20.0 |
1 | Luca | nan |
2 | Marco | 20.7 |
3 | Marco | 25.0 |