Data: the final frontier

by marco.p.v.vezzoli · February 1, 2025

Photo by Philipp Düsel on Unsplash

After heading onto Jupyter and meeting the Pandas let’s boldly go where no one has gone before!

Here are some powerful tools to explore and discover new lifeforms into our data

Introduction to Exploratory Data Analysis with Matplotlib and Seaborn

In this part we are going to focus on a quick exploration of the data, according to their type and number.

For simplicity we will talk about two main data kind:

categorical: i.e., a finite list of discrete values which may or may not have a specific order e.g., yellow, red, blue
continuous: i.e. numerical values (most often belonging to R) usually represented with a float computer type

Jupyter and pandas allow you to easily interact with the data and perform operations and visualization.

Installing basic libraries

Execute the following cell only if you need to install the seaborn library

!pip install --upgrade matplotlib seaborn

The following libraries are the foundation tools:

pandas is an in-memory dataframe library
matplotlib is a plotting library inspired by matlab plotting API
seaborn is a chart library based on matplotlib, with more functionalities and themes
numpy is a numeric calculation library providing fast c arrays and scientific functions

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tabulate import tabulate

Bird’s eye view of a dataset with Describe

let’s start with a classic dataset including the passengers of Titanic ship.

The read_csv function uploads this format in a pandas DataFrame which is a relation

Note: the titanic dataset was downloaded at the beginning of Part 2; in case you missing it execute the code at the beginning of the lesson

The .head() method returns the first lines of your data frame to quickly inspect it

titanic = pd.read_csv("datasets/titanic.csv")
df =titanic.head()[["Survived","Pclass","Age","Sex"]]

	Survived	Pclass	Age	Sex
0	0	3	22	male
1	1	1	38	female
2	1	3	26	female
3	1	1	35	female
4	0	3	35	male

the .describe() method returns basic statistics for all numerical columns

min
max
median
mean
quartiles
count of elements

by using the .describe(include“all”)= option also categorical values are shown with some other statistics:

number of unique discrete values
the most common one
its frequency

df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]

	Survived	Pclass	Age	Sex
count	891	891	714	891
unique	nan	nan	nan	2
top	nan	nan	nan	male
freq	nan	nan	nan	577
mean	0.3838	2.309	29.7	nan
std	0.4866	0.8361	14.53	nan
min	0	1	0.42	nan
25%	0	2	20.12	nan
50%	0	3	28	nan
75%	1	3	38	nan
max	1	3	80	nan

It is possible to access columns (called Series in pandas jargon) using the square bracket operator

titanic["Pclass"]

columns whose name is a good python identifier (i.e. starts with a letter and contains only letters, numbers and underscore) can be accessed using the dot notation e.g.

titanic.Pclass

each column has a data type, as csv do not carry any type information, this is inferred when loading; other binary data format also include a data type. The datas type of a column is saved in the .dtype attribute

pclass = titanic.Pclass
print(pclass.dtype)

int64

we know this column represents the class of the ticket so we expect it to have a finite number of actual values: we can check it with the .unique() method

df =pclass.unique()

we see this is a discrete valued columns so we can transform its type with the .astype() method

pclass = pclass.astype('category')
df =pclass.dtype

Now the statistics are represented differently for pClass

titanic["pClass"] = pclass
df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]

	Survived	Pclass	Age	Sex
count	891	891	714	891
unique	nan	nan	nan	2
top	nan	nan	nan	male
freq	nan	nan	nan	577
mean	0.3838	2.309	29.7	nan
std	0.4866	0.8361	14.53	nan
min	0	1	0.42	nan
25%	0	2	20.12	nan
50%	0	3	28	nan
75%	1	3	38	nan
max	1	3	80	nan

If we know in advance about the type of a column we can give some hint to the csv reader

titanic = pd.read_csv(
    "datasets/titanic.csv",
    dtype={
        "Survived":"category",
        "Pclass":"category",
        "Sex":"category",
    }
)

Monovariate Categorical

When we have a category series we can list all of the possible values using the .cat.categories attribute

print(pclass.cat.categories)

Index([1, 2, 3], dtype=’int64’)

the sns.countplot() function show a bar plot of categorical values

sns.countplot(pclass)

Monovariate Continuous

this dataframe collects pollutant density in California

california = pd.read_csv("california_pb_2023.csv")
df =california.describe(include="all")[['Daily Mean Pb Concentration', 'County']]

	Daily Mean Pb Concentration	County
count	1110	1110
unique	nan	13
top	nan	Los Angeles
freq	nan	458
mean	0.00699	nan
std	0.008124	nan
min	0	nan
25%	0.002863	nan
50%	0.00444	nan
75%	0.008	nan
max	0.101	nan

sns.histplot shows an histogram

sns.histplot(california,x="Daily Mean Pb Concentration")

	Daily Mean Pb Concentration	County
count	1110	1110
unique	nan	13
top	nan	Los Angeles
freq	nan	458
mean	0.00699	nan
std	0.008124	nan
min	0	nan
25%	0.002863	nan
50%	0.00444	nan
75%	0.008	nan
max	0.101	nan

This distribution looks like a lognormal distribution, let’s show a cumulative distribution and plot it with a logaritmic x axis

sorted_pb = np.sort(california["Daily Mean Pb Concentration"])
prob_pb = (np.arange(len(sorted_pb)) + 1)/len(sorted_pb)
ax=sns.lineplot(x=sorted_pb, y=prob_pb)
ax.set_xscale("log", base=10)

This looks nice so we can check by fitting a quantile plot

First we try with a normal quantile, we expect some queues

from scipy import stats
stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot)

We can fit it with a different distribution, so we choose a lognormal

stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot,dist=stats.distributions.lognorm(s=1))

this looks quite better

Multivariate Categorical

let’s consider a group of categorical variables and explore their interaction, the pd.crosstab() function provides a way to create a contingency table i.e. a table which counts all combination of the considered factors

titanic['survived'] = titanic.Survived.astype('category')
titanic['sex'] = titanic.Sex.astype('category')
titanic['pclass'] = titanic.Pclass.astype('category')
ct = pd.crosstab(titanic['survived'],columns=[titanic['sex'],titanic['pclass']])
df =ct

	(’female’, ’1’)	(’female’, ’2’)	(’female’, ’3’)	(’male’, ’1’)	(’male’, ’2’)	(’male’, ’3’)
0	3	6	72	77	91	300
1	91	70	72	45	17	47

the .plot.bar() method provides a quick way to display this information as grouped bar plot

ct.plot.bar()

ct.plot.bar(stacked=True)

Multivariate Continuous

the iris dataset is a collection of measurements of this flower’s features (sepal and petal length and width) across different varieties.

iris = pd.read_csv("iris.csv")
df =iris.head()

	sepal_length	sepal_width	petal_length	petal_width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5	3.6	1.4	0.2	Setosa

Two variables

the simplest way to look at the interaction between two of these features is the scatter plot

sns.scatterplot(iris,x="sepal_length",y="sepal_width")

Many variables

the same can be done with all the features in a large simmetric matrix.

In the diagonal are plotted histograms of the corresponding feature

sns.pairplot(iris)

Multivariate Mixed

One continuous variable against a one categorical variable

box plots present a graphical synopsis of distributions grouped by a category

the middle line represent the median
the top and bottom line of the box represent the 25th and 75th percentiles od the distribution
the top and bottom whiskers are usually calculated in this way:
1. select the most extreme sample value
2. calculate the interquartile range i.e. the distance between the 25th and 75th percentiles
3. multiply the interquartile range by 1.5 and sum to (or respectively subtract from) the median
4. between the most extreme value and the value calculated at point 3 choose the one which is nearest to the median
if the calculated value is chosen all samples which are farther from the mean are plotted as dot and may be interpreted as outliers

sns.boxplot(titanic,x="pclass",y="Age")

violin plots also show a smooth curve representng a continuous distribution calculated with kernel smoothing.

This provides more visual information than box plot but may be effectively used only when the number of groups is limited

sns.violinplot(titanic,x="pclass",y="Age")

Many continuous variables against one categorical variable

the scatter matrix can show groups from a single category using colors

The seaborn version also shows kernel density distributons

sns.pairplot(iris,hue="variety")

Many categorical variables against one or more continuous variables

When dealing with multiple categorical variable is also possible to define a bidimensional grid.

A plotting function can be applied on each subset represented in a given cell grid

g = sns.FacetGrid(titanic, col="sex", row='pclass')
g.map(sns.histplot, "Age")

interestingly this representation shows the different age distribution as a function of the gender and the class of passengers

Data: the final frontier

Introduction to Exploratory Data Analysis with Matplotlib and Seaborn

Installing basic libraries

Bird’s eye view of a dataset with Describe

Monovariate Categorical

Monovariate Continuous

Multivariate Categorical

Multivariate Continuous

Two variables

Many variables

Multivariate Mixed

One continuous variable against a one categorical variable

Many continuous variables against one categorical variable

Many categorical variables against one or more continuous variables

Like this:

Related

You may also like...

Data: the final frontier

Introduction to Exploratory Data Analysis with Matplotlib and Seaborn

Installing basic libraries

Bird’s eye view of a dataset with Describe

Monovariate Categorical

Monovariate Continuous

Multivariate Categorical

Multivariate Continuous

Two variables

Many variables

Multivariate Mixed

One continuous variable against a one categorical variable

Many continuous variables against one categorical variable

Many categorical variables against one or more continuous variables

Share this:

Like this:

Related

You may also like...

Python Tutorial: Setting up your environment

Growing a (sorting) tree

Python Tutorial: a few built-in basic functions