Data: the final frontier

philipp-dusel--Mbfhs0u4YQ-unsplash.jpg

Photo by Philipp Düsel on Unsplash

After heading onto Jupyter and meeting the Pandas let’s boldly go where no one has gone before!

Here are some powerful tools to explore and discover new lifeforms into our data

Introduction to Exploratory Data Analysis with Matplotlib and Seaborn

In this part we are going to focus on a quick exploration of the data, according to their type and number.

For simplicity we will talk about two main data kind:

  • categorical: i.e., a finite list of discrete values which may or may not have a specific order e.g., yellow, red, blue
  • continuous: i.e. numerical values (most often belonging to R) usually represented with a float computer type

Jupyter and pandas allow you to easily interact with the data and perform operations and visualization.

Installing basic libraries

Execute the following cell only if you need to install the seaborn library

!pip install --upgrade matplotlib seaborn

The following libraries are the foundation tools:

  • pandas is an in-memory dataframe library
  • matplotlib is a plotting library inspired by matlab plotting API
  • seaborn is a chart library based on matplotlib, with more functionalities and themes
  • numpy is a numeric calculation library providing fast c arrays and scientific functions
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tabulate import tabulate

Bird’s eye view of a dataset with Describe

let’s start with a classic dataset including the passengers of Titanic ship.

The read_csv function uploads this format in a pandas DataFrame which is a relation

Note: the titanic dataset was downloaded at the beginning of Part 2; in case you missing it execute the code at the beginning of the lesson

The .head() method returns the first lines of your data frame to quickly inspect it

titanic = pd.read_csv("datasets/titanic.csv")
df =titanic.head()[["Survived","Pclass","Age","Sex"]]
  Survived Pclass Age Sex
0 0 3 22 male
1 1 1 38 female
2 1 3 26 female
3 1 1 35 female
4 0 3 35 male

the .describe() method returns basic statistics for all numerical columns

  • min
  • max
  • median
  • mean
  • quartiles
  • count of elements

by using the .describe(include“all”)= option also categorical values are shown with some other statistics:

  • number of unique discrete values
  • the most common one
  • its frequency
df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]
  Survived Pclass Age Sex
count 891 891 714 891
unique nan nan nan 2
top nan nan nan male
freq nan nan nan 577
mean 0.3838 2.309 29.7 nan
std 0.4866 0.8361 14.53 nan
min 0 1 0.42 nan
25% 0 2 20.12 nan
50% 0 3 28 nan
75% 1 3 38 nan
max 1 3 80 nan

It is possible to access columns (called Series in pandas jargon) using the square bracket operator

titanic["Pclass"]

columns whose name is a good python identifier (i.e. starts with a letter and contains only letters, numbers and underscore) can be accessed using the dot notation e.g.

titanic.Pclass

each column has a data type, as csv do not carry any type information, this is inferred when loading; other binary data format also include a data type. The datas type of a column is saved in the .dtype attribute

pclass = titanic.Pclass
print(pclass.dtype)

int64

we know this column represents the class of the ticket so we expect it to have a finite number of actual values: we can check it with the .unique() method

df =pclass.unique()

we see this is a discrete valued columns so we can transform its type with the .astype() method

pclass = pclass.astype('category')
df =pclass.dtype

Now the statistics are represented differently for pClass

titanic["pClass"] = pclass
df =titanic.describe(include="all")[["Survived","Pclass","Age","Sex"]]
  Survived Pclass Age Sex
count 891 891 714 891
unique nan nan nan 2
top nan nan nan male
freq nan nan nan 577
mean 0.3838 2.309 29.7 nan
std 0.4866 0.8361 14.53 nan
min 0 1 0.42 nan
25% 0 2 20.12 nan
50% 0 3 28 nan
75% 1 3 38 nan
max 1 3 80 nan

If we know in advance about the type of a column we can give some hint to the csv reader

titanic = pd.read_csv(
    "datasets/titanic.csv",
    dtype={
        "Survived":"category",
        "Pclass":"category",
        "Sex":"category",
    }
)

Monovariate Categorical

When we have a category series we can list all of the possible values using the .cat.categories attribute

print(pclass.cat.categories)

Index([1, 2, 3], dtype=’int64’)

the sns.countplot() function show a bar plot of categorical values

sns.countplot(pclass)

04c5f7ee20b7c943d81ff65e17f36eaf85fead2b.png

Monovariate Continuous

this dataframe collects pollutant density in California

california = pd.read_csv("california_pb_2023.csv")
df =california.describe(include="all")[['Daily Mean Pb Concentration', 'County']]
  Daily Mean Pb Concentration County
count 1110 1110
unique nan 13
top nan Los Angeles
freq nan 458
mean 0.00699 nan
std 0.008124 nan
min 0 nan
25% 0.002863 nan
50% 0.00444 nan
75% 0.008 nan
max 0.101 nan

sns.histplot shows an histogram

sns.histplot(california,x="Daily Mean Pb Concentration")
  Daily Mean Pb Concentration County
count 1110 1110
unique nan 13
top nan Los Angeles
freq nan 458
mean 0.00699 nan
std 0.008124 nan
min 0 nan
25% 0.002863 nan
50% 0.00444 nan
75% 0.008 nan
max 0.101 nan

1e179c2227cfbdf703d241d0bb9385b826510526.png

This distribution looks like a lognormal distribution, let’s show a cumulative distribution and plot it with a logaritmic x axis

sorted_pb = np.sort(california["Daily Mean Pb Concentration"])
prob_pb = (np.arange(len(sorted_pb)) + 1)/len(sorted_pb)
ax=sns.lineplot(x=sorted_pb, y=prob_pb)
ax.set_xscale("log", base=10)

98650be7328261cabcd95fd83a1dc52ecb101acd.png

This looks nice so we can check by fitting a quantile plot

First we try with a normal quantile, we expect some queues

from scipy import stats
stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot)

07b8d558d22557e09c33cc108a169772832e1531.png

We can fit it with a different distribution, so we choose a lognormal

stats.probplot(california["Daily Mean Pb Concentration"], plot=sns.mpl.pyplot,dist=stats.distributions.lognorm(s=1))

9cfc294ea181926dd8c3f3a056d94b07f48e2909.png

this looks quite better

Multivariate Categorical

let’s consider a group of categorical variables and explore their interaction, the pd.crosstab() function provides a way to create a contingency table i.e. a table which counts all combination of the considered factors

titanic['survived'] = titanic.Survived.astype('category')
titanic['sex'] = titanic.Sex.astype('category')
titanic['pclass'] = titanic.Pclass.astype('category')
ct = pd.crosstab(titanic['survived'],columns=[titanic['sex'],titanic['pclass']])
df =ct
  (’female’, ’1’) (’female’, ’2’) (’female’, ’3’) (’male’, ’1’) (’male’, ’2’) (’male’, ’3’)
0 3 6 72 77 91 300
1 91 70 72 45 17 47

the .plot.bar() method provides a quick way to display this information as grouped bar plot

ct.plot.bar()

1d24d9251539adaf31f212d0a90dae8f08c90c42.png

ct.plot.bar(stacked=True)

6a211b5060db8a1a6052f4e092b775e7791ac988.png

Multivariate Continuous

the iris dataset is a collection of measurements of this flower’s features (sepal and petal length and width) across different varieties.

iris = pd.read_csv("iris.csv")
df =iris.head()
  sepal_length sepal_width petal_length petal_width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5 3.6 1.4 0.2 Setosa

Two variables

the simplest way to look at the interaction between two of these features is the scatter plot

sns.scatterplot(iris,x="sepal_length",y="sepal_width")

4c419a0ca12a4e26ba41985fdfac20af73b56257.png

Many variables

the same can be done with all the features in a large simmetric matrix.

In the diagonal are plotted histograms of the corresponding feature

sns.pairplot(iris)

40e4e0e7a7353c852c5d91fb906062bb585cae19.png

Multivariate Mixed

One continuous variable against a one categorical variable

box plots present a graphical synopsis of distributions grouped by a category

  • the middle line represent the median
  • the top and bottom line of the box represent the 25th and 75th percentiles od the distribution
  • the top and bottom whiskers are usually calculated in this way:
    1. select the most extreme sample value
    2. calculate the interquartile range i.e. the distance between the 25th and 75th percentiles
    3. multiply the interquartile range by 1.5 and sum to (or respectively subtract from) the median
    4. between the most extreme value and the value calculated at point 3 choose the one which is nearest to the median
  • if the calculated value is chosen all samples which are farther from the mean are plotted as dot and may be interpreted as outliers
sns.boxplot(titanic,x="pclass",y="Age")

ec07ee2fc870feb9b837a8e21ba0fac0069235ca.png

violin plots also show a smooth curve representng a continuous distribution calculated with kernel smoothing.

This provides more visual information than box plot but may be effectively used only when the number of groups is limited

sns.violinplot(titanic,x="pclass",y="Age")

ffdee419bb3798d2d38b21cf42559025e5b59f8e.png

Many continuous variables against one categorical variable

the scatter matrix can show groups from a single category using colors

The seaborn version also shows kernel density distributons

sns.pairplot(iris,hue="variety")

d434a97f3c99b61d058aae62790fa73ea533b7b4.png

Many categorical variables against one or more continuous variables

When dealing with multiple categorical variable is also possible to define a bidimensional grid.

A plotting function can be applied on each subset represented in a given cell grid

g = sns.FacetGrid(titanic, col="sex", row='pclass')
g.map(sns.histplot, "Age")

8a2ea5e8dd1009fa17e54573f5038253f725ff8d.png

interestingly this representation shows the different age distribution as a function of the gender and the class of passengers

marco.p.v.vezzoli

Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions

You may also like...

Leave a Reply