Coming back down to Earth

dylan-de-jonge-9SjCXUq_qSE-unsplash.jpg Photo by Dylan de Jonge on Unsplash

Our journey in the galaxy of Python analytics started with a visit on Jupyter where we met the Pandas and started our fearless exploration of data.

It is sometime useful to leave the colorful interactive environment of Jupyter to create more prosaic standalone python script which are easier to automate or included in a larger project.

It’s time to come back to Earth: this won’t be a step back, rather it can be an opportunity to start something bigger, like a larger team project or to create an application.

Create a Script from a Jupyter Notebook

Sometime it is useful to transform your notebook in an actual script e.g.:

  • if you want it to be executed automatically (unattended)
  • if you want to create a module out of it in order to use it in a bigger application

extract code from Jupyter

Jupyter has a very wide range of formats to export the content of a notebook; some of them are a graphical export (e.g. html or pdf – this requires a latex installation) or textual export (ascii, rtf), etc.

To start creating a python script from your notebook, open it into Jupyter, then

  1. from the File menu select save and export notebook as
  2. from the submenu select Executable script this will save a file in your Download directory named with the title of your notebook, but with .py extension

You can now open this file with your favorite code editor; here is what you will see:

  • before every cell there will be a comment with the cell execution number
  • every code cell will be copied in an individual block of commands
  • markdown cells and raw cells will be presented as comments
  • output will be removed

In the following sections I will list some cleanup actions which will help you transform this script into a more manageable piece of code

manage magic code

magic code is translated into the equivalent python call e.g.

?sum

becomes

get_ipython().run_line_magic('pinfo', 'sum')

most of the time you may want to get rid of all of this kind of code as some functionalities (e.g. accessing documentation) are intended only for interactive usage within a jupyter notebook.

Other functionalities (e.g. timing your cell execution) may be better managed with other libraries.

add code to save tables

Jupyter conveniently shows pandas =DataFrame=s as tables; you may want to extract these results into files.

To access these tables from a script you can save them into files

small size

If your table are small you may be willing to save them in some simple format:

large size

For larger tables or more complex tasks binary formats may help.

  • Apache Parquet is a columnar binary format ideal for large data collections and high performance computation; it requires some optional library e.g. pyarrow see documentation

      df.to_parquet("my_file.pqt")
    
  • hdf5 is a binary format which can contain multiple tables in a single file (see documentation ) python df.to_hdf("my_file.pqt")

databases

Pandas has a simple mapping from tables to data frames which requires sqlalchemy and a driver of the database. Python basic distribution include sqlite but many more are available as optional packages python df.to_sql(name’my_table’, con=engine)=

add code to save figures

the following example works for matplotlib and any library based on it, like seaborn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
california = pd.read_csv("california_pb_2023.csv")

The simplest way to save an image is to

  1. store the Axis object created into some variable
  2. reach the Figure object from the .figure attribute
  3. use the .savefig() method

ax = sns.histplot(california,x="Daily Mean Pb Concentration")
ax.figure.savefig("pb_2003.png")

1e179c2227cfbdf703d241d0bb9385b826510526.png

A more complex sequence is required when working with multiple plots either stacked or overlapped.

In this case the pd.subplot() function creates multiple charts (axis) in a single figure

# this code is general: two charts in a single row
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Horizontally stacked subplots')

# this code is specific for this function
from scipy import stats
stats.probplot(california["Daily Mean Pb Concentration"], plot=ax1)
stats.probplot(california["Daily Mean Pb Concentration"], plot=ax2,dist=stats.distributions.lognorm(s=1))

fig.savefig("probplots.png")

0a2e775b9b7dabcdc191e7fcca8c978fccd8992d.png

clean up the code

The following suggestions holds for any python script and are not strictly required for the execution.

  • move all import statements at the beginning of the file
  • organize the code in functions and classes; possibly add type annotations
  • create a single entry point at the bottom of the code with the usual python if __name__ =main”: main()=
  • add command line options management using libraries like optparse (see here)
  • separate data and configuration from code: libraries like toml (see here) can help reading configuration files
  • transform absolute paths into relative paths
  • consider using pyproject.html to collect dependencies and constraints
  • consider using a linter (e.g. pylint or ruff) to evaluate code inconsistencies
  • create unit tests to verify your functions individually; pytest helps in this task
  • add documentation per each function or class as well as a module doc string
  • use a code formatter to keep your style consistent (e.g. black)

add shell script to launch the code

I find it very convenient to have a shell script taking care of

  • setting up the working directory of the process properly
  • activate any virtual environment as needed
  • fix the environment variables

e.g.

#!/bin/bash

# change the process directory to this one
cd $(dirname $0)

# activate a local virtual environment
source .venv/bin/activate

# set up some environment variables
export PYTHONPATH=$(pwd)

# launch the application
# forwards all command line arguments
python -m myapp $@

Exercise

transform the notebook you created and edited in the previous section (Exploratory Data Analysis) in an executable script

marco.p.v.vezzoli

Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions

You may also like...

Leave a Reply