Meet the Pandas

Loading data

Pandas includes a rich set of input functions that allow you to get data from various file types

function	format	notes
`pd.read_csv`	textual csv
`pd.read_excel`	binary excel format	requires external library
`pd.read_parquet`	fast binary columnar format	requires pyarrow

A data frame contains many functions to explore it e.g. the .head() method shows the first lines of a data frame

titanic = pd.read_csv("datasets/titanic.csv")
titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.25	nan	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.925	nan	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.05	nan	S

Projection, Selection and Extension

a data frame is a table; you can get its column names using the .columns attributes

titanic.columns

Index([’PassengerId’, ’Survived’, ’Pclass’, ’Name’, ’Sex’, ’Age’, ’SibSp’, ’Parch’, ’Ticket’, ’Fare’, ’Cabin’, ’Embarked’], dtype=’object’)

columns can be accessed individually or in groups; this operation is called projection

Single columns can be accessed either

using a square bracket operator df.["age"]
using the dot operator if the column name is a good identifier df.age

Each column is called a Series in pandas jargon

Groups of columns can be accessed by passing a list of strings to the bracket operator

titanic[["Survived","Pclass","Sex","Age"]].head()

	Survived	Pclass	Sex	Age
0	0	3	male	22.0
1	1	1	female	38.0
2	1	3	female	26.0
3	1	1	female	35.0
4	0	3	male	35.0

operations on series are vectorized i.e. each individual element is used to get a new vector

Operations within a series and a scalar value are repeated for all values of a series

titanic.Pclass == 1

returns a series of booleans

By passing a list of booleans to the square bracket operators this filters all of the lines which are satisfying the logic statement expressed; this operation is called selection which is a synonim for filter

titanic[titanic.Pclass == 1].head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	17463	51.8625	E46	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	113783	26.55	C103	S
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	113788	35.5	A6	S

Usually selection and projection are used at the same time; the .loc[,] operator can be conveniently used for this purpose; its arguments are:

a boolean list for rows or the splice operator : for no filter
a string list of column names or the splice operator : for all columns

titanic.loc[titanic.Pclass==1,["Survived","Sex","Age"]].head()

	Survived	Sex	Age
1	1	female	38.0
3	1	female	35.0
6	0	male	54.0
11	1	female	58.0
23	1	male	28.0

It is possible to extend a table with more columns possibly as a result from a computation in other columns

To create a new column, just assign an expression to a new column name e.g.

df["above_average"] = (df.score > df.score.mean())

countries = pd.read_csv("datasets/WorldDBTables/CountryTable.csv")
countries.columns

Index([’code’, ’name’, ’continent’, ’region’, ’surface_area’, ’independence_year’, ’population’, ’life_expectancy’, ’gnp’, ’gnp_old’, ’local_name’, ’government_form’, ’head_of_state’, ’capital’, ’code2’], dtype=’object’)

Exercise

calculate the population density of each country

The countries table contains the population size in the population column and the land extension in the surface area

calculate the ratio of these two columns and store it in a new column called population density

countries["population_density"] = countries.population /  countries.surface_area
countries.loc[:,["name","population_density"]].head()

	name	population_density
0	Aruba	533.6787564766839
1	Afghanistan	34.84181631369903
2	Angola	10.32967032967033
3	Anguilla	83.33333333333333
4	Albania	118.31083901488799

sort the table in descending order using the .sort_values function
restrict the columns to only the ["name","population_density"] columns
show the first lines of the table using the .head() method: what are the most densely populated countries?

countries.sort_values("population_density",ascending=False).loc[:,["name","population_density"]].head()

	name	population_density
129	Macao	26277.777777777777
131	Monaco	22666.666666666668
92	Hong Kong	6308.837209302325
186	Singapore	5771.844660194175
79	Gibraltar	4166.666666666667

Join and concatenation

A relation may be composed by more than a table; this may offer some consistency and operation efficiency.

If two tables represents entitites which are related they can be joined by selecting one or more columns which contains those attributes which creates the relationship.

Per each matched rows in a table, this is replicated as many times as the matched rows in the other table

There are four kinds of available joins

join	data incuded	added missing values
inner	only rows which belongs to both tables	none
left	all rows of the first table	for all non matching rows in the first table
right	all rows of the second table	for all non matching rows in the second table
outer	all rows of both tables	for all non matching rows

Suppose we have a list of courses, classrooms and classroom booking per each course; if we want to know where each professor should hold his lesson we need to join these tables

course_id	title	professor
1	quantum field theory	Bohr
2	thermodynamics	Carnot
3	statistics	Gosset

classroom_id	building	floor
p124	Purple	1
r201	Red	2

course_id	classroom_id	weekday	start	end
1	p124	Monday	9	11
1	r201	Wednesday	14	15
2	r201	Tuesday	14	17
3	r201	Monday	14	15
3	p124	Tuesday	9	10
3	p124	Wednesday	9	10

the pd.merge() function performs the join operation e.g.

courses_classrooms = pd.merge(courses,classroom)
courses_bookings = ps.merge(courses_classroom, bookings)

The default kind of join is inner you can use the how= optional argument to choose another kind.

pd.merge will join by default all columns with identical name: if you want to restrict the join to a given list of column you can use the on= option.

If you have different names for the join columns you can use left_on= and right_on= options to match them.

Exercise

in the country table we have a list of countries including their population
in the languages table we have a list of languages spoken in each country and the percentage of the population which speaks said language
in the country table we have a textual code which is uniquely assigned to each county
in the languages table we have the same code in a column called country_code
load the language table from datasets/WorldDBTables/LanguageTable.csv using the pd.read_csv function and store it in a variable called languages
create a table named language_by_country using the pd.merge function and joining the column code of table countries with the column country_code from the languages table
calculate the number of people speaking a language by multiplying the population column with the percentage column (don’t forget to divide by 100!); put the result in a column called poeple_speaking
show some lines of the table keeping only the following columns: ["name","language","people_speaking","official"] what do you see?

languages = pd.read_csv("datasets/WorldDBTables/LanguageTable.csv")

languages_by_country = pd.merge(
    countries, languages, 
    how="inner", 
    left_on=["code"], right_on=["country_code"]
)
languages_by_country["people_speaking"] = languages_by_country.population * \
    languages_by_country.percentage / 100
languages_by_country[["name","language","people_speaking","official"]].head()

	name	language	people_speaking	official
0	Aruba	Dutch	5459.0	T
1	Aruba	English	9785.0	F
2	Aruba	Papiamento	79001.0	F
3	Aruba	Spanish	7622.0	F
4	Afghanistan	Balochi	204480.0	F

Concatenation

It may happen that your data is collected in separated dataframes with the same columns ans you need to create a single one from all of them.

unzip ROMA.zip TG_SOUID100860.txt

unzip BARI.zip TG_SOUID245914.txt

roma = pd.read_csv("TG_SOUID100860.txt",skiprows=20)
roma.head()

	SOUID	DATE	TG
0	100860	19510101	76
1	100860	19510102	108
2	100860	19510103	116
3	100860	19510104	115
4	100860	19510105	82

bari = pd.read_csv("TG_SOUID245914.txt",skiprows=20)
bari.head()

	SOUID	DATE	TG	Q_TG
0	245914	20211201	-9999	9
1	245914	20211202	-9999	9
2	245914	20211203	-9999	9
3	245914	20211204	-9999	9
4	245914	20211205	-9999	9

the pd.concat() function can concatenate a list of data frames; the default behavior is consistent with the semantic of relations and it retunrns a single data frame:

columns will be the union of all columns of each individual data frame in the input
rows will keep the same order as the data frames

temperatures = pd.concat([roma,bari])

# this will remove extra spaces from column names
temperatures.columns = list(map(str.strip,temperatures.columns))

# this will transform the column type
for col in ["SOUID","Q_TG"]:
    temperatures[col] = temperatures[col].astype("category")
temperatures["DATE"]=pd.to_datetime(temperatures["DATE"],format="%Y%m%d")
print(temperatures.Q_TG.unique())
temperatures.loc[temperatures.Q_TG != 9,:].describe(include="all")

	SOUID	DATE	TG	Q_TG
count	21717.0	np.int64	(21717)	21717.0	21717.0
unique	2.0	nan	nan	2.0
top	100860.0	nan	nan	0.0
freq	21525.0	nan	nan	21711.0
mean	nan	Timestamp	(1980-11-11 03:17:47.716535360)	154.8837316388083	nan
min	nan	Timestamp	(1951-01-01 00:00:00)	-56.0	nan
25%	nan	Timestamp	(1965-11-17 00:00:00)	101.0	nan
50%	nan	Timestamp	(1980-09-28 00:00:00)	150.0	nan
75%	nan	Timestamp	(1995-08-22 00:00:00)	212.0	nan
max	nan	Timestamp	(2022-10-18 00:00:00)	327.0	nan
std	nan	nan	66.53937042433274	nan

[0, 9, 1]
Categories (3, int64): [0, 1, 9]

Aggregation

very often you may want to group your data according to one or more attribute and perform some calculation on each group, this operation is called aggregation

e.g. suppose I want to split a restaurant bill with my friends and I have a dataframe which looks like the following table

person	item	amount
me	pepperoni pizza	12
me	lager pils	5
andrea	cheeseburger	10
andrea	coca cola	2
andrea	french fries	2

groups = bill.groupby(["person"])
groups.agg({"amount":"sum"})

will return

person	amount
me	17
andrea	14

It is also possible to ask for multiple aggregation by using a list of functions

g = titanic.groupby(["Pclass","Sex"])
age_summary = g.agg({"Age":["min","max","mean"]})
age_summary

	(Age min)	(Age max)	(Age mean)
(1 female)	2.0	63.0	34.61176470588235
(1 male)	0.92	80.0	41.28138613861386
(2 female)	2.0	57.0	28.722972972972972
(2 male)	0.67	70.0	30.74070707070707
(3 female)	0.75	63.0	21.75
(3 male)	0.42	74.0	26.507588932806325

Please note that here the generated columns are accessible using a tuple i.e.

age_summary[("Age","mean")]

Exercise

using the languages_by_country table we created in the previous exercise

create a grouping by using the "language" column
using the .agg() method calculate how many people speak each language
sort the dataset from the largest group descending
show the first lines using .head() method

g = languages_by_country.groupby(["language"])
languages_spoken = g.agg({"people_speaking":"sum"})
languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False)
languages_spoken_sorted.head(20)

language	people_speaking
Chinese	1190152805.0
Hindi	405619174.0
Spanish	307997398.0
Bengali	209304719.0
Arabic	205490840.7
Portuguese	176981914.4
Japanese	126254034.0
Punjabi	104025371.0
English	91616031.3
Javanese	83570158.0
Telugu	79065636.0
Marathi	75010988.0
Korean	71450757.0
Vietnamese	69908416.0
Tamil	68682272.0
French	67947730.0
Urdu	63589470.0
Italian	57183654.1
Gujarati	48655776.0
Malay	41517994.0

g = languages_by_country.groupby(["continent","language"])
languages_spoken = g.agg({"people_speaking":"sum"})
languages_spoken_sorted = languages_spoken.sort_values("people_speaking",ascending=False)
languages_spoken_sorted.head(20)

	people_speaking
(Asia Chinese)	1189353427.0
(Asia Hindi)	405169038.0
(Asia Bengali)	209304719.0
(South America Portuguese)	166037997.0
(South America Spanish)	145620868.0
(Africa Arabic)	134392131.0
(North America Spanish)	132707046.0
(Asia Japanese)	125573574.0
(Asia Punjabi)	103807342.0
(Asia Javanese)	83570158.0
(Asia Telugu)	79065636.0
(Asia Marathi)	75010988.0
(Asia Korean)	71445687.0
(Asia Vietnamese)	69908416.0
(Asia Arabic)	69184280.0
(Asia Tamil)	68682272.0
(Asia Urdu)	63589470.0
(Europe English)	61799068.300000004
(Europe French)	60455448.0
(Europe Italian)	55344151.1

languages_spoken_sorted=languages_spoken_sorted.reset_index()
g = languages_spoken_sorted.groupby(["continent"])
result = []
for i,subtable in g:
    result.append(subtable.head(3).reset_index())
pd.concat(result).head()

	index	continent	language	people_speaking
0	5	Africa	Arabic	134392131.0
1	32	Africa	Hausa	29225396.0
2	33	Africa	Joruba	24868874.0
0	0	Asia	Chinese	1189353427.0
1	1	Asia	Hindi	405169038.0

Translate the content of a table

Suppose we need to translate some foreign language content

resources = pd.read_csv("ds523_consumoacquaenergia.csv",sep=";")
resources.head()

	anno	Consumo pro capite tipo	Consumo pro capite
0	2011	Energia elettrica per uso domestico	1196.1
1	2011	Gas metano per uso domestico e riscaldamento	377.9
2	2011	Acqua fatturata per uso domestico	83.1
3	2010	Energia elettrica per uso domestico	1200.7
4	2010	Gas metano per uso domestico e riscaldamento	406.2

The second column looks like a categorical series, so let’s check it

resources["Consumo pro capite tipo"].unique()

[’Energia elettrica per uso domestico’ ’Gas metano per uso domestico e riscaldamento’ ’Acqua fatturata per uso domestico’]

we can pass a dictionary to the .map() method like this:

translate = {
    'Energia elettrica per uso domestico':'electricity',
    'Gas metano per uso domestico e riscaldamento':'methan',
    'Acqua fatturata per uso domestico':'water'
}
resources["type"] = resources["Consumo pro capite tipo"].map(translate)

Also columns can be renamed or removed

resources = resources.rename({"anno":"year","Consumo pro capite":"usage per person"}, axis="columns")
del resources["Consumo pro capite tipo"]
resources.head()

	year	usage per person	type
0	2011	1196.1	electricity
1	2011	377.9	methan
2	2011	83.1	water
3	2010	1200.7	electricity
4	2010	406.2	methan

Pivoting and melting

Pivot is a family of aggregation functions whose main purpose is to collect data from a relation and aggregate them by using one or more attribute columns.

This process will create a column per each combination of the attributes; the result table is sometime referred as “wide format” table or “two entries table”; let’s make an example

resources2 = resources.pivot(index="year",columns="type",values="usage per person").reset_index()
resources2

	year	electricity	methan	water
0	2000.0	1130.2	509.0	92.1
1	2001.0	1143.9	500.7	91.3
2	2002.0	1195.5	504.2	90.4
3	2003.0	1222.8	480.2	87.3
4	2004.0	1228.6	442.4	80.4
5	2005.0	1225.0	434.5	81.3
6	2006.0	1219.7	431.3	82.2
7	2007.0	1197.0	381.1	81.6
8	2008.0	1203.0	384.9	84.5
9	2009.0	1202.9	389.6	85.8
10	2010.0	1200.7	406.2	83.2
11	2011.0	1196.1	377.9	83.1

As there was exactly one value per each year and each commodity the previous example just moved values without performing any calculation.

Suppose now we want to split some restaurant bill

bill = pd.DataFrame([
    {"item":"pepperoni pizza", "amount":12, "customer": "Marco", "day": "Monday"},
    {"item":"beer", "amount":7.5, "customer": "Marco", "day": "Monday"},
    {"item":"coffee", "amount":1.2, "customer": "Marco", "day": "Monday"},
    {"item":"pizza margherita", "amount":10, "customer": "Luca", "day": "Monday"},
    {"item":"wine", "amount":10, "customer": "Luca", "day": "Monday"},
    {"item":"steak", "amount":20, "customer": "Marco", "day": "Tuesday"},
    {"item":"bottled water", "amount":5, "customer": "Marco", "day": "Tuesday"},
])
bill

	item	amount	customer	day
0	pepperoni pizza	12.0	Marco	Monday
1	beer	7.5	Marco	Monday
2	coffee	1.2	Marco	Monday
3	pizza margherita	10.0	Luca	Monday
4	wine	10.0	Luca	Monday
5	steak	20.0	Marco	Tuesday
6	bottled water	5.0	Marco	Tuesday

pandas function pivot_table allows to define an aggregation function in case of collision

splitted_bill = pd.pivot_table(bill,index="day",values="amount",columns="customer",aggfunc="sum")
splitted_bill

day	Luca	Marco
Monday	20.0	20.7
Tuesday	nan	25.0

pandas pd.melt() function provides a way to get a “long format” table

pd.melt(splitted_bill)

	customer	value
0	Luca	20.0
1	Luca	nan
2	Marco	20.7
3	Marco	25.0

Using Pandas (basic introduction)

Loading data

Projection, Selection and Extension

Exercise

Join and concatenation

Exercise

Concatenation

Aggregation

Exercise

Translate the content of a table

Pivoting and melting

Like this:

Related

You may also like...

Meet the Pandas

Using Pandas (basic introduction)

Loading data

Projection, Selection and Extension

Exercise

Join and concatenation

Exercise

Concatenation

Aggregation

Exercise

Translate the content of a table

Pivoting and melting

Share this:

Like this:

Related

You may also like...

Data: the final frontier

Take the (statistical) Test

Crates and cargo