Data: the final frontier
Photo by Philipp Düsel on Unsplash
After heading onto Jupyter and meeting the Pandas let’s boldly go where no one has gone before!
Here are some powerful tools to explore and discover new lifeforms into our data
Introduction to Exploratory Data Analysis with Matplotlib and Seaborn
In this part we are going to focus on a quick exploration of the data, according to their type and number.
For simplicity we will talk about two main data kind:
- categorical: i.e., a finite list of discrete values which may or may not have a specific order e.g.,
yellow
,red
,blue
- continuous: i.e. numerical values (most often belonging to R) usually represented with a
float
computer type
Jupyter and pandas allow you to easily interact with the data and perform operations and visualization.
Installing basic libraries
Execute the following cell only if you need to install the seaborn library
!pip install --upgrade matplotlib seaborn
The following libraries are the foundation tools:
- pandas is an in-memory dataframe library
- matplotlib is a plotting library inspired by matlab plotting API
- seaborn is a chart library based on matplotlib, with more functionalities and themes
- numpy is a numeric calculation library providing fast c arrays and scientific functions
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np from tabulate import tabulate
Bird’s eye view of a dataset with Describe
let’s start with a classic dataset including the passengers of Titanic ship.
The read_csv
function uploads this format in a pandas DataFrame
which is a relation
Note: the titanic dataset was downloaded at the beginning of Part 2; in case you missing it execute the code at the beginning of the lesson
The .head()
method returns the first lines of your data frame to quickly inspect it
titanic = pd.read_csv("datasets/titanic.csv") df =titanic.head()[["Survived","Pclass","Age","Sex"]]
Survived | Pclass | Age | Sex | |
---|---|---|---|---|
0 | 0 | 3 | 22 | male |
1 | 1 | 1 | 38 | female |
2 | 1 | 3 | 26 | female |
3 | 1 | 1 | 35 | female |
4 | 0 | 3 | 35 | male |
the .describe()
method returns basic statistics for all numerical columns
- min
- max
- median
- mean
- quartiles
- count of elements
by using the .describe(include
“all”)= option also categorical values are shown with some other statistics:
- number of unique discrete values
- the most common one
- its frequency
df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]
Survived | Pclass | Age | Sex | |
---|---|---|---|---|
count | 891 | 891 | 714 | 891 |
unique | nan | nan | nan | 2 |
top | nan | nan | nan | male |
freq | nan | nan | nan | 577 |
mean | 0.3838 | 2.309 | 29.7 | nan |
std | 0.4866 | 0.8361 | 14.53 | nan |
min | 0 | 1 | 0.42 | nan |
25% | 0 | 2 | 20.12 | nan |
50% | 0 | 3 | 28 | nan |
75% | 1 | 3 | 38 | nan |
max | 1 | 3 | 80 | nan |
It is possible to access columns (called Series
in pandas jargon) using the square bracket operator
titanic["Pclass"]
columns whose name is a good python identifier (i.e. starts with a letter and contains only letters, numbers and underscore) can be accessed using the dot notation e.g.
titanic.Pclass
each column has a data type, as csv
do not carry any type information, this is inferred when loading; other binary data format also include a data type. The datas type of a column is saved in the .dtype
attribute
pclass = titanic.Pclass print(pclass.dtype)
int64
we know this column represents the class of the ticket so we expect it to have a finite number of actual values: we can check it with the .unique()
method
df =pclass.unique()
we see this is a discrete valued columns so we can transform its type with the .astype()
method
pclass = pclass.astype('category') df =pclass.dtype
Now the statistics are represented differently for pClass
titanic["pClass"] = pclass df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]
Survived | Pclass | Age | Sex | |
---|---|---|---|---|
count | 891 | 891 | 714 | 891 |
unique | nan | nan | nan | 2 |
top | nan | nan | nan | male |
freq | nan | nan | nan | 577 |
mean | 0.3838 | 2.309 | 29.7 | nan |
std | 0.4866 | 0.8361 | 14.53 | nan |
min | 0 | 1 | 0.42 | nan |
25% | 0 | 2 | 20.12 | nan |
50% | 0 | 3 | 28 | nan |
75% | 1 | 3 | 38 | nan |
max | 1 | 3 | 80 | nan |
If we know in advance about the type of a column we can give some hint to the csv reader
titanic = pd.read_csv( "datasets/titanic.csv", dtype={ "Survived":"category", "Pclass":"category", "Sex":"category", } )
Monovariate Categorical
Monovariate Continuous
this dataframe collects pollutant density in California
california = pd.read_csv("california_pb_2023.csv") df =california.describe(include="all")[['Daily Mean Pb Concentration', 'County']]
Daily Mean Pb Concentration | County | |
---|---|---|
count | 1110 | 1110 |
unique | nan | 13 |
top | nan | Los Angeles |
freq | nan | 458 |
mean | 0.00699 | nan |
std | 0.008124 | nan |
min | 0 | nan |
25% | 0.002863 | nan |
50% | 0.00444 | nan |
75% | 0.008 | nan |
max | 0.101 | nan |
sns.histplot
shows an histogram
sns.histplot(california,x="Daily Mean Pb Concentration")
Daily Mean Pb Concentration | County | |
---|---|---|
count | 1110 | 1110 |
unique | nan | 13 |
top | nan | Los Angeles |
freq | nan | 458 |
mean | 0.00699 | nan |
std | 0.008124 | nan |
min | 0 | nan |
25% | 0.002863 | nan |
50% | 0.00444 | nan |
75% | 0.008 | nan |
max | 0.101 | nan |
This distribution looks like a lognormal distribution, let’s show a cumulative distribution and plot it with a logaritmic x axis
sorted_pb = np.sort(california["Daily Mean Pb Concentration"]) prob_pb = (np.arange(len(sorted_pb)) + 1)/len(sorted_pb) ax=sns.lineplot(x=sorted_pb, y=prob_pb) ax.set_xscale("log", base=10)
This looks nice so we can check by fitting a quantile plot
First we try with a normal quantile, we expect some queues
from scipy import stats stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot)
We can fit it with a different distribution, so we choose a lognormal
stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot,dist=stats.distributions.lognorm(s=1))
Multivariate Categorical
let’s consider a group of categorical variables and explore their interaction, the pd.crosstab()
function provides a way to create a contingency table i.e. a table which counts all combination of the considered factors
titanic['survived'] = titanic.Survived.astype('category') titanic['sex'] = titanic.Sex.astype('category') titanic['pclass'] = titanic.Pclass.astype('category') ct = pd.crosstab(titanic['survived'],columns=[titanic['sex'],titanic['pclass']]) df =ct
(’female’, ’1’) | (’female’, ’2’) | (’female’, ’3’) | (’male’, ’1’) | (’male’, ’2’) | (’male’, ’3’) | |
---|---|---|---|---|---|---|
0 | 3 | 6 | 72 | 77 | 91 | 300 |
1 | 91 | 70 | 72 | 45 | 17 | 47 |
the .plot.bar()
method provides a quick way to display this information as grouped bar plot
ct.plot.bar()
ct.plot.bar(stacked=True)
Multivariate Continuous
the iris
dataset is a collection of measurements of this flower’s features (sepal and petal length and width) across different varieties.
iris = pd.read_csv("iris.csv") df =iris.head()
sepal_length | sepal_width | petal_length | petal_width | variety | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | Setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | Setosa |
Two variables
Multivariate Mixed
One continuous variable against a one categorical variable
box plots present a graphical synopsis of distributions grouped by a category
- the middle line represent the median
- the top and bottom line of the box represent the 25th and 75th percentiles od the distribution
- the top and bottom whiskers are usually calculated in this way:
- select the most extreme sample value
- calculate the interquartile range i.e. the distance between the 25th and 75th percentiles
- multiply the interquartile range by 1.5 and sum to (or respectively subtract from) the median
- between the most extreme value and the value calculated at point 3 choose the one which is nearest to the median
- if the calculated value is chosen all samples which are farther from the mean are plotted as dot and may be interpreted as outliers
sns.boxplot(titanic,x="pclass",y="Age")
violin plots also show a smooth curve representng a continuous distribution calculated with kernel smoothing.
This provides more visual information than box plot but may be effectively used only when the number of groups is limited
sns.violinplot(titanic,x="pclass",y="Age")
Many continuous variables against one categorical variable
Many categorical variables against one or more continuous variables
When dealing with multiple categorical variable is also possible to define a bidimensional grid.
A plotting function can be applied on each subset represented in a given cell grid
g = sns.FacetGrid(titanic, col="sex", row='pclass') g.map(sns.histplot, "Age")
interestingly this representation shows the different age distribution as a function of the gender and the class of passengers