Hold the Line
Photo by Sam Goodgame on Unsplash
We made quite a journey so far! Starting from Jupyter and Pandas we explored our datasets and created independent scripts.
It is now the time to learn the basics of a very powerful tool: Linear Regression.
Linearity is a key concept in mathematics: it goes very far from the naive idea of a straight line into more abstract concepts like the independence of two effecs into a dynamic system.
The jupyter notebooks for this series of posts, the datasets, their source and their attribution are available in this Github Repo
Correlations and linear models
A linear model estimate a response from the linear combination of one or more inputs
we look for the best which minimize all residuals
import pandas as pd
This dataset collects the yearly water and energy consuption estimation per capita in Milan collected by the italian government
Data is grouped by
- water consumption
- methan consumption
- electricity consumption
let’s first load this resource usage dataset
cons = pd.read_csv("ds523_consumoacquaenergia.csv",sep=";")
We can quickly overview its content
cons.describe(include="all")
anno Consumo pro capite tipo Consumo pro capite count 36.00000 36 36.000000 unique NaN 3 NaN top NaN Energia elettrica per uso domestico NaN freq NaN 12 NaN mean 2005.50000 NaN 573.072222 std 3.50102 NaN 471.777743 min 2000.00000 NaN 80.400000 25% 2002.75000 NaN 89.625000 50% 2005.50000 NaN 432.900000 75% 2008.25000 NaN 1195.650000 max 2011.00000 NaN 1228.600000
It requires some cleanup: first let’s check the consumption type
cons["Consumo pro capite tipo"].unique()
array(['Energia elettrica per uso domestico', 'Gas metano per uso domestico e riscaldamento', 'Acqua fatturata per uso domestico'], dtype=object)
Now let’s translate these categories
translate = { 'Energia elettrica per uso domestico':'electricity', 'Gas metano per uso domestico e riscaldamento':'methan', 'Acqua fatturata per uso domestico':'water' } cons["type"] = cons["Consumo pro capite tipo"].map(translate)
Finally we can reshape the dataset to split the different kind of resources
cons2 = cons.pivot(index="anno",columns="type",values="Consumo pro capite").reset_index() cons2 = cons2.rename({"anno":"year"}, axis="columns") cons2
type year electricity methan water 0 2000 1130.2 509.0 92.1 1 2001 1143.9 500.7 91.3 2 2002 1195.5 504.2 90.4 3 2003 1222.8 480.2 87.3 4 2004 1228.6 442.4 80.4 5 2005 1225.0 434.5 81.3 6 2006 1219.7 431.3 82.2 7 2007 1197.0 381.1 81.6 8 2008 1203.0 384.9 84.5 9 2009 1202.9 389.6 85.8 10 2010 1200.7 406.2 83.2 11 2011 1196.1 377.9 83.1
Now we can make use of our scatter matrix to further investigate this dataset
import seaborn as sns sns.pairplot(cons2)
Looks like there is some kind of variation of the methan usage in time: we can try to make a linear regression and see how it does look like
sns.regplot(cons2,x="year",y="methan")
<Axes: xlabel='year', ylabel='methan'>
Covariance and correlation
The covariance matrix, defined as
is the multidimensional extension of the variance, elements of the diagonal being the variance of the corrisponding dimension; its eigenvectors define an ellipsoid representing the most important combinations of the dimensional features; this is used in Principal Compaonent Analysis, a technique which helps to define the most impactful features.
By dividing each element with the product of the standard deviations we have the correlation matrix
The elements outside the diagonal are numbers between -1 and 1; 0 represents no correlation (like a spherical cloud) while 1 and -1 represent positive and negative correlation respectively; this gives us a first estimation of the possible linear dependecies within a set of observation features;
import numpy as np # numpy expects a matrix where each feature is in a row instead of a column # thus we need to transpose it np.corrcoef(np.transpose(np.array(cons2)))
array([[ 1. , 0.44786015, -0.93548315, -0.65540971], [ 0.44786015, 1. , -0.46029677, -0.77514369], [-0.93548315, -0.46029677, 1. , 0.75208366], [-0.65540971, -0.77514369, 0.75208366, 1. ]])
we can see that the negative correlation between year and methan is about -0.9 which makes it a good candidate for a linear correlation
from scipy import stats
Regression calculation
in this simple case we have
- few observations
- only one input value so we may directly use the Ordinary Least Squares regression method to evaluate the best fit
result = stats.linregress(x=cons2.year, y=cons2.methan) result
LinregressResult(slope=np.float64(-13.141258741258738), intercept=np.float64(26791.62773892773), rvalue=np.float64(-0.9354831530794605), pvalue=np.float64(7.894692952340763e-06), stderr=np.float64(1.5697563928623894), intercept_stderr=np.float64(3148.151109622701))
the returned object contains some interesting values; let’s check the first two:
- slope
- intercept
allows us to write a simple prediction formula
def predict_methan(year): return result.slope * year + result.intercept
with this formula we can build a chart of our linear regression
import matplotlib.pyplot as plt import seaborn as sns
# create a plot canvas fig, ax = plt.subplots(1,1) #first plot the points into our canvas sns.scatterplot(x=cons2.year, y=cons2.methan, ax=ax) # then plot a line from the first to the last point on the same canvas year0 = min(cons2.year) year1 = max(cons2.year) ax.plot((year0,year1),(predict_methan(year0),predict_methan(year1)))
note: the polymorphism allows to properly use the prodict_methan function also with pandas Series
Assessing the quality of a regression
residuals = cons2.methan - predict_methan(cons2.year)
looking at residuals distribution may show some pattern; in this case we may assume there is a better way to represent the relation between the features under investigation.
In our example looks like there is no apparent pattern
ax = sns.scatterplot(x=cons2.year, y=residuals) ax.plot((year0,year1),(0,0)) ax.set_ylabel("residuals")
Text(0, 0.5, 'residuals')
The next step would be to assess the variance of residuals respect to the total variance of the distribution of the output variable Y:
let’s use to represent the predicted values; by knowing that the mean of the residuals is 0 and their definition
we have
now the quantity
represent the fraction of the variance of the original dataset explained by the linear relation: this is a real number between 0 and 1 where 0 represents no actual explaination (i.e. the mean has the same prediction power) to 1 representing all the relation is explained
Multiple input parameters
in order to perform this regression with multiple inputs we are going to use the statmodels
library (see documentation)
Execute the following cell only the first time
!pip install statsmodels
import statsmodels as sm from statsmodels.api import formula as smf import requests import pandas as pd
We will use a crime dataset from UCLA
headers = "crimerat maleteen south educ police60 police59 labor males pop nonwhite unemp1 unemp2 median belowmed".split() crime = pd.read_csv( "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/crime.txt", sep=r"\s+", names=headers, dtype=float )
This is the description of the content of this table
Columns | meaning |
---|---|
CrimeRat | Crime rate: # of offenses reported to police per million population |
MaleTeen | The number of males of age 14-24 per 1000 population |
South | Indicator variable for Southern states (0 = No, 1 = Yes) |
Educ | Mean # of years of schooling for rpersons of age 25 or older |
Police60 | 1960 per capita expenditure on police by state and local government |
Police59 | 1959 per capita expenditure on police by state and local government |
Labor | Labor force participation rate per 1000 civilian urban males age 14-24 |
Males | The number of males per 1000 females |
Pop | State population size in hundred thousands |
NonWhite | The number of non-whites per 1000 population |
Unemp1 | Unemployment rate of urban males per 1000 of age 14-24 |
Unemp2 | Unemployment rate of urban males per 1000 of age 35-39 |
Median | Median value of transferable goods and assets or family income in tens of $ |
BelowMed | The number of families per 1000 earning below 1/2 the median income |
crime.head()
crimerat maleteen south educ police60 police59 labor males pop \ 0 79.1 151.0 1.0 9.1 58.0 56.0 510.0 950.0 33.0 1 163.5 143.0 0.0 11.3 103.0 95.0 583.0 1012.0 13.0 2 57.8 142.0 1.0 8.9 45.0 44.0 533.0 969.0 18.0 3 196.9 136.0 0.0 12.1 149.0 141.0 577.0 994.0 157.0 4 123.4 141.0 0.0 12.1 109.0 101.0 591.0 985.0 18.0 nonwhite unemp1 unemp2 median belowmed 0 301.0 108.0 41.0 394.0 261.0 1 102.0 96.0 36.0 557.0 194.0 2 219.0 94.0 33.0 318.0 250.0 3 80.0 102.0 39.0 673.0 167.0 4 30.0 91.0 20.0 578.0 174.0
The south
feature is actually categorical and cannot be treated in the same way as others but let’s pretend it is not different for this exercise
crime.describe()
crimerat maleteen south educ police60 police59 \ count 47.000000 47.000000 47.000000 47.00000 47.000000 47.000000 mean 90.508511 138.574468 0.340426 10.56383 85.000000 80.234043 std 38.676270 12.567634 0.478975 1.11870 29.718974 27.961319 min 34.200000 119.000000 0.000000 8.70000 45.000000 41.000000 25% 65.850000 130.000000 0.000000 9.75000 62.500000 58.500000 50% 83.100000 136.000000 0.000000 10.80000 78.000000 73.000000 75% 105.750000 146.000000 1.000000 11.45000 104.500000 97.000000 max 199.300000 177.000000 1.000000 12.20000 166.000000 157.000000 labor males pop nonwhite unemp1 unemp2 \ count 47.000000 47.000000 47.000000 47.000000 47.000000 47.000000 mean 561.191489 983.021277 36.617021 101.127660 95.468085 33.978723 std 40.411814 29.467365 38.071188 102.828819 18.028783 8.445450 min 480.000000 934.000000 3.000000 2.000000 70.000000 20.000000 25% 530.500000 964.500000 10.000000 24.000000 80.500000 27.500000 50% 560.000000 977.000000 25.000000 76.000000 92.000000 34.000000 75% 593.000000 992.000000 41.500000 132.500000 104.000000 38.500000 max 641.000000 1071.000000 168.000000 423.000000 142.000000 58.000000 median belowmed count 47.000000 47.000000 mean 525.382979 194.000000 std 96.490944 39.896061 min 288.000000 126.000000 25% 459.500000 165.500000 50% 537.000000 176.000000 75% 591.500000 227.500000 max 689.000000 276.000000
Note that there are some very skewed distributions like the non white which has a very large standard deviation respect to the mean; this value also shows a long queue according to the percentiles.
Moreover, due to their definitions some features have very different ranges.
This may have an impact in evaluating the eigenvectors as some dimensions may appear as more relevant then others due to their scale.
For these reasons we may expect that renormalizing all distributions respect to their standard deviation may change our findings.
Evaluating correlations and covariance
import numpy as np crime_array = np.transpose(np.array(crime)) covariance = np.cov(crime_array) correlation = np.corrcoef(crime_array) pd.DataFrame({"correlation":correlation[0,1:],"features":headers[1:]})
correlation features 0 -0.089472 maleteen 1 -0.090637 south 2 0.322835 educ 3 0.687604 police60 4 0.666714 police59 5 0.188866 labor 6 0.213914 males 7 0.337474 pop 8 0.032599 nonwhite 9 -0.050478 unemp1 10 0.177321 unemp2 11 0.441320 median 12 -0.179024 belowmed
apparently the crime rate most relevant correlation seems to be the increase in police expenditure which may probably be more a consequence than a causation
from numpy.linalg._linalg import EigResult # eigenvectors will be returned already sorted from the most to the least relevant result :EigResult = np.linalg.eig(covariance) def relevant(headers: [str], result: EigResult, rank: int): """retruns the features of the rank-th eigenvalue sorted from the largest descending""" # extract the rank-th eigenvector vector = result.eigenvectors[:,rank] # square it to get rid of sign vector_sq = vector * vector # get the order from smallest to largest order = vector_sq.argsort() # reverse order and return the features from the most relevant return [headers[int(i)] for i in reversed(order)]
let’s grab the 5 most relevant set of features
for i in range(5): print(relevant(headers, result, i))
['nonwhite', 'median', 'belowmed', 'police60', 'police59', 'labor', 'crimerat', 'maleteen', 'males', 'pop', 'unemp1', 'educ', 'south', 'unemp2'] ['nonwhite', 'median', 'crimerat', 'pop', 'police60', 'police59', 'males', 'belowmed', 'labor', 'unemp1', 'unemp2', 'maleteen', 'south', 'educ'] ['labor', 'males', 'pop', 'crimerat', 'nonwhite', 'belowmed', 'maleteen', 'unemp2', 'unemp1', 'police59', 'educ', 'police60', 'median', 'south'] ['pop', 'crimerat', 'median', 'belowmed', 'nonwhite', 'labor', 'police60', 'police59', 'males', 'unemp1', 'unemp2', 'maleteen', 'educ', 'south'] ['labor', 'crimerat', 'pop', 'males', 'unemp1', 'unemp2', 'belowmed', 'police60', 'nonwhite', 'police59', 'median', 'maleteen', 'south', 'educ']
Performing regression from multiple inputs
In the following multilinear correlation we construct a formula representing the features which may impact to the expected output
output ~ feature1 + feature2 + feature3
I chose to use all of the features which appear as most relevant in the first eigenvector and appear before our output
formula = "crimerat ~ "+ (" + ".join(relevant(headers, result, 0)[:6])) print(formula) model = smf.ols(formula,crime) regression = model.fit() regression.summary()
crimerat ~ nonwhite + median + belowmed + police60 + police59 + labor
<class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: crimerat R-squared: 0.638 Model: OLS Adj. R-squared: 0.584 Method: Least Squares F-statistic: 11.75 Date: Sun, 05 Jan 2025 Prob (F-statistic): 1.48e-07 Time: 21:48:11 Log-Likelihood: -214.10 No. Observations: 47 AIC: 442.2 Df Residuals: 40 BIC: 455.2 Df Model: 6 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -304.9695 96.968 -3.145 0.003 -500.950 -108.989 nonwhite 0.0050 0.056 0.088 0.930 -0.109 0.119 median 0.1588 0.112 1.419 0.164 -0.067 0.385 belowmed 0.6875 0.223 3.085 0.004 0.237 1.138 police60 1.3928 1.140 1.222 0.229 -0.910 3.696 police59 -0.3685 1.239 -0.297 0.768 -2.872 2.135 labor 0.1592 0.100 1.594 0.119 -0.043 0.361 ============================================================================== Omnibus: 2.339 Durbin-Watson: 2.004 Prob(Omnibus): 0.311 Jarque-Bera (JB): 1.581 Skew: -0.436 Prob(JB): 0.454 Kurtosis: 3.220 Cond. No. 2.16e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.16e+04. This might indicate that there are strong multicollinearity or other numerical problems. """
The result of the fit method which is shown here above displays a wealth of information; most notably
- some quality evaluation of the regression e.g.
- all the evaluated parameters and the intercept
Exploring features
it is also important to not blindly accept the result of a regression without a further analysis of the dataset
In the following code I will check how the output variable depends on the features we examined; as this plot does not really show the interdipendence of all features some images may be difficult to interpret
fig, axs = mpl.subplots(1,6,sharey=True,figsize=(18,3)) features = relevant(headers, result, 0)[:6] for i in range(6): sns.scatterplot(x=crime[features[i]],y=crime.crimerat,ax=axs[i])
Correcting eigenvector bias with correlation matrix
by using the correlation instead of the covariance, the range of all features is normalized now between -1 and 1
As we can see the most interesting eigenvectors change
result2 = np.linalg.eig(correlation)
for i in range(5): print(relevant(headers, result2, i))
['median', 'belowmed', 'educ', 'police59', 'police60', 'south', 'maleteen', 'nonwhite', 'crimerat', 'labor', 'males', 'pop', 'unemp1', 'unemp2'] ['pop', 'labor', 'unemp2', 'males', 'police60', 'police59', 'nonwhite', 'crimerat', 'south', 'educ', 'median', 'belowmed', 'unemp1', 'maleteen'] ['unemp1', 'unemp2', 'labor', 'maleteen', 'crimerat', 'males', 'nonwhite', 'police59', 'police60', 'pop', 'south', 'educ', 'belowmed', 'median'] ['males', 'crimerat', 'maleteen', 'labor', 'nonwhite', 'belowmed', 'unemp1', 'pop', 'south', 'unemp2', 'police60', 'police59', 'educ', 'median'] ['pop', 'labor', 'belowmed', 'south', 'maleteen', 'police59', 'median', 'police60', 'unemp2', 'educ', 'unemp1', 'nonwhite', 'crimerat', 'males']
rank_no = 0 features_count = 8 formula = "crimerat ~ "+ (" + ".join(relevant(headers, result2, rank_no)[:features_count])) print(formula) model = smf.ols(formula,crime) regression = model.fit() regression.summary()
crimerat ~ median + belowmed + educ + police59 + police60 + south + maleteen + nonwhite
<class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: crimerat R-squared: 0.730 Model: OLS Adj. R-squared: 0.673 Method: Least Squares F-statistic: 12.82 Date: Sat, 04 Jan 2025 Prob (F-statistic): 1.02e-08 Time: 21:27:44 Log-Likelihood: -207.24 No. Observations: 47 AIC: 432.5 Df Residuals: 38 BIC: 449.1 Df Model: 8 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -537.5940 108.276 -4.965 0.000 -756.786 -318.402 median 0.1764 0.101 1.740 0.090 -0.029 0.382 belowmed 0.8438 0.211 3.994 0.000 0.416 1.271 educ 14.4615 5.068 2.853 0.007 4.201 24.722 police59 -0.8715 1.099 -0.793 0.433 -3.096 1.353 police60 1.8952 1.015 1.868 0.069 -0.159 3.949 south -1.9020 12.426 -0.153 0.879 -27.057 23.253 maleteen 0.9286 0.379 2.451 0.019 0.161 1.696 nonwhite -0.0025 0.060 -0.041 0.967 -0.124 0.119 ============================================================================== Omnibus: 0.285 Durbin-Watson: 1.792 Prob(Omnibus): 0.867 Jarque-Bera (JB): 0.010 Skew: -0.016 Prob(JB): 0.995 Kurtosis: 3.064 Cond. No. 2.02e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.02e+04. This might indicate that there are strong multicollinearity or other numerical problems. """
Interestingly this correlation shows a better than the previous one thus demonstrating the effectiveness of using normalized distributions
rank_no = 0 features_count = 8 fig, axs = mpl.subplots(1,features_count,sharey=True,figsize=(features_count * 3,3)) features = relevant(headers, result2, 0)[:features_count] for i in range(features_count): sns.scatterplot(x=crime[features[i]],y=crime.crimerat,ax=axs[i])
More visualization of the correlations
in the following examples I will show a couple of scatter plots of the most relevant features and use colors for the output variable; while this visualization does not add a great insight, nonetheless can raise interesting questions about the mutual connections of the features
sns.scatterplot(x=crime.belowmed,y=crime["median"],hue=crime.crimerat)
<Axes: xlabel='belowmed', ylabel='median'>
This image shows that some of the highest crime rate seems to show in an area where economic indicators seems more favorable, which just demonstrates how complex and controversial this analysis may be: deciding which features to include may have important consequences.
A 3d version of the same plot adding the education feature
#sns.scatterplot(x=crime.belowmed,y=crime["median"],hue=crime.crimerat) from mpl_toolkits.mplot3d import Axes3D sns.set_style("whitegrid", {'axes.grid' : False}) fig = plt.figure() ax = Axes3D(fig) fig.add_axes(ax) x=crime.belowmed y=crime["median"] z=crime.educ ax.scatter(x, y, z, c=crime.crimerat, marker='o') ax.set_xlabel('belowmed') ax.set_ylabel('median') ax.set_zlabel('educ') ax
<Axes3D: xlabel='belowmed', ylabel='median', zlabel='educ'>
Non-Linear features
the linearity of linear models is defined by the interaction between different features but this may be used in with non linear cases e.g. trying to fit a polynomial model
# this library is used to read excel files !pip install openpyxl
import pandas as pd
The following dataset describes financial performances metrics across many countries
financial = pd.read_excel("20220909-global-financial-development-database.xlsx",sheet_name="Data - August 2022")
let’s first set some attributes as categorical: we may use them eventually as a filter
for col in ["iso3", "iso2", "imfn", "country", "region", "income"]: financial[col] = financial[col].astype("category")
In this example we will focus on a particular financial metric di01
financial[["country","region","di01"]].describe(include="all")
country region di01 count 13268 13268 8594.000000 unique 214 7 NaN top Afghanistan Europe & Central Asia NaN freq 62 3596 NaN mean NaN NaN 37.321250 std NaN NaN 34.811684 min NaN NaN 0.010371 25% NaN NaN 13.054380 50% NaN NaN 26.018790 75% NaN NaN 50.293530 max NaN NaN 304.574500
import seaborn as sns
sns.scatterplot(financial,x="year",y="di01",hue="region")
<Axes: xlabel='year', ylabel='di01'>
Let’s first narrow it to a single country and show its dependency from time
fin_italy = financial.loc[financial["country"]=="Italy",:] sns.scatterplot(fin_italy,x="year",y="di01")
<Axes: xlabel='year', ylabel='di01'>
This shows some kind of growing trend: let’s first try a simple linear regression respect to the years
sns.regression.regplot(fin_italy,x="year",y="di01")
<Axes: xlabel='year', ylabel='di01'>
from scipy.stats import linregress
fin_italy[["year","di01"]].describe()
year di01 count 62.000000 57.000000 mean 1990.500000 65.643938 std 18.041619 13.908666 min 1960.000000 46.931830 25% 1975.250000 54.484480 50% 1990.500000 62.710020 75% 2005.750000 75.462590 max 2021.000000 93.921490
We see this dataset does not contain metrics for all years so let’s remove rows without values
fin_italy_valued = fin_italy.loc[~fin_italy.di01.isna(),["year","di01"]]
Here we see the results of the regression
result = linregress(y=fin_italy_valued.di01,x=fin_italy_valued.year) result
LinregressResult(slope=np.float64(0.4768662371619365), intercept=np.float64(-884.1481148904821), rvalue=np.float64(0.5972453694029883), pvalue=np.float64(9.37869396904616e-07), stderr=np.float64(0.08635123015454191), intercept_stderr=np.float64(171.99538887670158))
rsquare = result.rvalue ** 2 rsquare
np.float64(0.356702031273312)
let’s plot the residuals to see any clear behavior
residuals = fin_italy_valued.di01 - (fin_italy_valued.year * result.slope + result.intercept) sns.scatterplot(x=fin_italy_valued.year, y=residuals)
<Axes: xlabel='year', ylabel='None'>
Adding nonlinear features
For simplicity of the fit we will use a column with years calculated as a difference from the first one.
In this case residuals suggests a kind of oscillatory behavior but this is way too complex for this tutorial as it involves the evaluation of periods of the oscillations and phase shifts.
The simpler way to increase the fit can be to use a higher degree polynomial.
import statsmodels as sm from statsmodels.api import formula as smf import requests import pandas as pd
Let’s create the nonlinear feature columns for a polynomial of degree 3. The higher the degree the lower the error: choosing an excessively large degree can lead to overfitting without adding much more insight
fin_italy_valued["dy"] = fin_italy_valued.year - min(fin_italy_valued.year) fin_italy_valued["dy2"] = fin_italy_valued.dy ** 2 fin_italy_valued["dy3"] = fin_italy_valued.dy ** 3
now we can fit and get the coefficients for these features
model = smf.ols("di01 ~ dy + dy2 + dy3",fin_italy_valued) regression = model.fit() regression.summary()
<class 'statsmodels.iolib.summary.Summary'> """ OLS Regression Results ============================================================================== Dep. Variable: di01 R-squared: 0.598 Model: OLS Adj. R-squared: 0.575 Method: Least Squares F-statistic: 26.30 Date: Sun, 19 Jan 2025 Prob (F-statistic): 1.48e-10 Time: 19:48:07 Log-Likelihood: -204.44 No. Observations: 57 AIC: 416.9 Df Residuals: 53 BIC: 425.1 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 69.6046 4.436 15.690 0.000 60.707 78.503 dy -1.8193 0.670 -2.715 0.009 -3.163 -0.475 dy2 0.0614 0.027 2.262 0.028 0.007 0.116 dy3 -0.0004 0.000 -1.350 0.183 -0.001 0.000 ============================================================================== Omnibus: 6.613 Durbin-Watson: 0.134 Prob(Omnibus): 0.037 Jarque-Bera (JB): 3.703 Skew: 0.419 Prob(JB): 0.157 Kurtosis: 2.075 Cond. No. 2.84e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.84e+05. This might indicate that there are strong multicollinearity or other numerical problems. """
The value improved from .39 to .59
Here are our coefficients: the third degree adds very little contribution
regression.params
Intercept 69.604622 dy -1.819269 dy2 0.061431 dy3 -0.000417 dtype: float64
With them we can now plot the polynomial and verify the new fit
fin_italy_valued["predicted"] = regression.params[0] + \ regression.params[1] * fin_italy_valued.dy + \ regression.params[2] * fin_italy_valued.dy2 + \ regression.params[3] * fin_italy_valued.dy3
fig, ax = plt.subplots(1, 1) sns.scatterplot(data=fin_italy_valued,x="year",y="di01", ax=ax) sns.lineplot(data=fin_italy_valued,x="year",y="predicted", ax=ax)