{"id":572,"date":"2025-02-16T16:48:38","date_gmt":"2025-02-16T15:48:38","guid":{"rendered":"https:\/\/noiseonthenet.space\/noise\/?p=572"},"modified":"2025-02-16T16:48:40","modified_gmt":"2025-02-16T15:48:40","slug":"hold-the-line","status":"publish","type":"post","link":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/","title":{"rendered":"Hold the Line"},"content":{"rendered":"<p> <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?ssl=1\" alt=\"sam-goodgame-Pe5BC-EDtB4-unsplash.jpg\" \/> Photo by <a href=\"https:\/\/unsplash.com\/@sgoodgame?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash\">Sam Goodgame<\/a> on <a href=\"https:\/\/unsplash.com\/photos\/san-francisco-bridge-Pe5BC-EDtB4?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash\">Unsplash<\/a> <br> <\/p>\n\n<p> We made quite a journey so far! Starting from <a href=\"https:\/\/noiseonthenet.space\/noise\/2025\/01\/a-trip-to-jupyter-lab\/\">Jupyter<\/a> and <a href=\"https:\/\/noiseonthenet.space\/noise\/2025\/01\/meet-the-pandas\/\">Pandas<\/a> we <a href=\"https:\/\/noiseonthenet.space\/noise\/2025\/02\/data-the-final-frontier\/\">explored our datasets<\/a> and <a href=\"https:\/\/noiseonthenet.space\/noise\/2025\/02\/coming-back-down-to-earth\/\">created independent scripts<\/a>. <br> <\/p>\n\n<p> It is now the time to learn the basics of a very powerful tool: Linear Regression. <br> <\/p>\n\n<p> Linearity is a key concept in mathematics: it goes very far from the naive idea of a straight line into more abstract concepts like the independence of two effecs into a dynamic system. <br> <\/p>\n\n<p> The jupyter notebooks for this series of posts, the datasets, their source and their attribution are available in this <a href=\"https:\/\/github.com\/noiseOnTheNet\/python-post023_jupyter_analitics\">Github Repo<\/a> <br> <\/p>\n\n<p> <a id=\"orgfcb9f5a\"><\/a> <\/p>\n<div id=\"outline-container-correlations-and-linear-models\" class=\"outline-2\">\n<h2 id=\"correlations-and-linear-models\">Correlations and linear models<\/h2>\n<div class=\"outline-text-2\" id=\"text-correlations-and-linear-models\">\n<p> A linear model estimate a response from the linear combination of one or more inputs <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y+%5Capprox+x_1+%5Calpha_1+%2B+y_2+%5Calpha_2+%2B+...+%2B+y_n+%5Calpha_n+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y &#92;approx x_1 &#92;alpha_1 + y_2 &#92;alpha_2 + ... + y_n &#92;alpha_n \" class=\"latex\" \/> <\/p> \n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=y+%5Capprox+%5Cvec%7Bx%7D+%5Ccdot+%5Cvec%7B%5Calpha%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"y &#92;approx &#92;vec{x} &#92;cdot &#92;vec{&#92;alpha} \" class=\"latex\" \/> <\/p> \n\n<p> <a id=\"orgd2a29ba\"><\/a> given an estimation <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7By%7D+%3D+%5Cvec%7Bx%7D+%5Ccdot+%5Cvec%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{y} = &#92;vec{x} &#92;cdot &#92;vec{&#92;alpha}\" class=\"latex\" \/> <\/p>\n\n<p> we look for the best <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cvec%7B%5Calpha%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;vec{&#92;alpha}\" class=\"latex\" \/> which minimize all residuals <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon%28%5Chat%7By%7D%29+%3D+y+-+%5Chat%7By%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon(&#92;hat{y}) = y - &#92;hat{y} \" class=\"latex\" \/> <\/p> \n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> pandas <span style=\"color: #cba6f7;\">as<\/span> pd\n<\/pre>\n<\/div>\n\n<p> <a id=\"org0ac9d2e\"><\/a> This dataset collects the yearly water and energy consuption estimation per capita in Milan collected by the italian government <\/p>\n\n<p> Data is grouped by <\/p>\n\n<ul class=\"org-ul\">\n<li>water consumption<\/li>\n<li>methan consumption<\/li>\n<li>electricity consumption<\/li>\n<\/ul>\n\n<p> let&rsquo;s first load this resource usage dataset <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">cons<\/span> <span style=\"color: #89dceb;\">=<\/span> pd.read_csv(<span style=\"color: #a6e3a1;\">\"ds523_consumoacquaenergia.csv\"<\/span>,sep<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\";\"<\/span>)\n<\/pre>\n<\/div>\n\n<p> We can quickly overview its content <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">cons.describe(include<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"all\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n              anno              Consumo pro capite tipo  Consumo pro capite\ncount     36.00000                                   36           36.000000\nunique         NaN                                    3                 NaN\ntop            NaN  Energia elettrica per uso domestico                 NaN\nfreq           NaN                                   12                 NaN\nmean    2005.50000                                  NaN          573.072222\nstd        3.50102                                  NaN          471.777743\nmin     2000.00000                                  NaN           80.400000\n25%     2002.75000                                  NaN           89.625000\n50%     2005.50000                                  NaN          432.900000\n75%     2008.25000                                  NaN         1195.650000\nmax     2011.00000                                  NaN         1228.600000\n<\/pre>\n\n\n<p> It requires some cleanup: first let&rsquo;s check the consumption type <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">cons[<span style=\"color: #a6e3a1;\">\"Consumo pro capite tipo\"<\/span>].unique()\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\narray(['Energia elettrica per uso domestico',\n       'Gas metano per uso domestico e riscaldamento',\n       'Acqua fatturata per uso domestico'], dtype=object)\n<\/pre>\n\n\n<p> Now let&rsquo;s translate these categories <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">translate<\/span> <span style=\"color: #89dceb;\">=<\/span> {\n    <span style=\"color: #a6e3a1;\">'Energia elettrica per uso domestico'<\/span>:<span style=\"color: #a6e3a1;\">'electricity'<\/span>,\n    <span style=\"color: #a6e3a1;\">'Gas metano per uso domestico e riscaldamento'<\/span>:<span style=\"color: #a6e3a1;\">'methan'<\/span>,\n    <span style=\"color: #a6e3a1;\">'Acqua fatturata per uso domestico'<\/span>:<span style=\"color: #a6e3a1;\">'water'<\/span>\n}\n<span style=\"color: #cdd6f4;\">cons<\/span>[<span style=\"color: #a6e3a1;\">\"type\"<\/span>] <span style=\"color: #89dceb;\">=<\/span> cons[<span style=\"color: #a6e3a1;\">\"Consumo pro capite tipo\"<\/span>].<span style=\"color: #f38ba8;\">map<\/span>(translate)\n<\/pre>\n<\/div>\n\n<p> Finally we can reshape the dataset to split the different kind of resources <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">cons2<\/span> <span style=\"color: #89dceb;\">=<\/span> cons.pivot(index<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"anno\"<\/span>,columns<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"type\"<\/span>,values<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"Consumo pro capite\"<\/span>).reset_index()\n<span style=\"color: #cdd6f4;\">cons2<\/span> <span style=\"color: #89dceb;\">=<\/span> cons2.rename({<span style=\"color: #a6e3a1;\">\"anno\"<\/span>:<span style=\"color: #a6e3a1;\">\"year\"<\/span>}, axis<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"columns\"<\/span>)\ncons2\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\ntype  year  electricity  methan  water\n0     2000       1130.2   509.0   92.1\n1     2001       1143.9   500.7   91.3\n2     2002       1195.5   504.2   90.4\n3     2003       1222.8   480.2   87.3\n4     2004       1228.6   442.4   80.4\n5     2005       1225.0   434.5   81.3\n6     2006       1219.7   431.3   82.2\n7     2007       1197.0   381.1   81.6\n8     2008       1203.0   384.9   84.5\n9     2009       1202.9   389.6   85.8\n10    2010       1200.7   406.2   83.2\n11    2011       1196.1   377.9   83.1\n<\/pre>\n\n\n<p> Now we can make use of our scatter matrix to further investigate this dataset <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> seaborn <span style=\"color: #cba6f7;\">as<\/span> sns\nsns.pairplot(cons2)\n<\/pre>\n<\/div>\n\n<div id=\"orgf03ccfe\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/b29f0e25b66fbc630eccdb1dbe9e0e331d1f4cb8.png?ssl=1\" alt=\"b29f0e25b66fbc630eccdb1dbe9e0e331d1f4cb8.png\" \/> <\/p> <\/div>\n\n<p> Looks like there is some kind of variation of the methan usage in time: we can try to make a linear regression and see how it does look like <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">sns.regplot(cons2,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"methan\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='year', ylabel='methan'&gt;\n<\/pre>\n\n\n<div id=\"orgc31258b\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/27bb1e3606cc5a352a90cee654ce719aa4ad5982.png?ssl=1\" alt=\"27bb1e3606cc5a352a90cee654ce719aa4ad5982.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org642f803\"><\/a> <\/p>\n<\/div>\n<div id=\"outline-container-covariance-and-correlation\" class=\"outline-3\">\n<h3 id=\"covariance-and-correlation\">Covariance and correlation<\/h3>\n<div class=\"outline-text-3\" id=\"text-covariance-and-correlation\">\n<p> The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Covariance_matrix\">covariance matrix<\/a>, defined as <\/p>\n\n<p> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=cov%5BX_i%2C+X_j%5D+%3D+E%5B%28X_i+-+E%5BX_i%5D%29%28X_j+-+E%5BX_j%5D%29%5D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"cov[X_i, X_j] = E[(X_i - E[X_i])(X_j - E[X_j])]\" class=\"latex\" \/> <\/p>\n\n<p> is the multidimensional extension of the variance, elements of the diagonal being the variance of the corrisponding dimension; its eigenvectors define an ellipsoid representing the most important combinations of the dimensional features; this is used in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis\">Principal Compaonent Analysis<\/a>, a technique which helps to define the most impactful features. <\/p>\n\n<p> By dividing each element with the product of the standard deviations we have the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Correlation\">correlation matrix<\/a> <\/p>\n\n<p> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=corr%5BX_i%2C+X_j%5D+%3D+%5Cfrac%7BE%5B%28X_i+-+E%5BX_i%5D%29%28X_j+-+E%5BX_j%5D%29%5D%7D%7B%5Csigma_i%5Csigma_j%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"corr[X_i, X_j] = &#92;frac{E[(X_i - E[X_i])(X_j - E[X_j])]}{&#92;sigma_i&#92;sigma_j}\" class=\"latex\" \/> <\/p>\n\n<p> The elements outside the diagonal are numbers between -1 and 1; 0 represents no correlation (like a spherical cloud) while 1 and -1 represent positive and negative correlation respectively; this gives us a first estimation of the possible linear dependecies within a set of observation features; <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> numpy <span style=\"color: #cba6f7;\">as<\/span> np\n<span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">numpy expects a matrix where each feature is in a row instead of a column<\/span>\n<span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">thus we need to transpose it<\/span>\nnp.corrcoef(np.transpose(np.array(cons2)))\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\narray([[ 1.        ,  0.44786015, -0.93548315, -0.65540971],\n       [ 0.44786015,  1.        , -0.46029677, -0.77514369],\n       [-0.93548315, -0.46029677,  1.        ,  0.75208366],\n       [-0.65540971, -0.77514369,  0.75208366,  1.        ]])\n<\/pre>\n\n\n<p> <a id=\"org3aeadca\"><\/a> we can see that the negative correlation between year and methan is about -0.9 which makes it a good candidate for a linear correlation <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">from<\/span> scipy <span style=\"color: #cba6f7;\">import<\/span> stats\n<\/pre>\n<\/div>\n\n<p> <a id=\"org6c9117a\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-regression-calculation\" class=\"outline-3\">\n<h3 id=\"regression-calculation\">Regression calculation<\/h3>\n<div class=\"outline-text-3\" id=\"text-regression-calculation\">\n<p> in this simple case we have <\/p>\n\n<ul class=\"org-ul\">\n<li>few observations<\/li>\n<li>only one input value so we may directly use the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Ordinary_least_squares\">Ordinary Least Squares regression method<\/a> to evaluate the best fit<\/li>\n<\/ul>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">result<\/span> <span style=\"color: #89dceb;\">=<\/span> stats.linregress(x<span style=\"color: #89dceb;\">=<\/span>cons2.year, y<span style=\"color: #89dceb;\">=<\/span>cons2.methan)\nresult\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\nLinregressResult(slope=np.float64(-13.141258741258738), intercept=np.float64(26791.62773892773), rvalue=np.float64(-0.9354831530794605), pvalue=np.float64(7.894692952340763e-06), stderr=np.float64(1.5697563928623894), intercept_stderr=np.float64(3148.151109622701))\n<\/pre>\n\n\n<p> <a id=\"orgc061f4d\"><\/a> the returned object contains some interesting values; let&rsquo;s check the first two: <\/p>\n\n<ul class=\"org-ul\">\n<li>slope<\/li>\n<li>intercept<\/li>\n<\/ul>\n\n<p> allows us to write a simple prediction formula <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">def<\/span> <span style=\"color: #89b4fa;\">predict_methan<\/span>(year):\n    <span style=\"color: #cba6f7;\">return<\/span> result.slope <span style=\"color: #89dceb;\">*<\/span> year <span style=\"color: #89dceb;\">+<\/span> result.intercept\n<\/pre>\n<\/div>\n\n<p> <a id=\"orgca521cb\"><\/a> with this formula we can build a chart of our linear regression <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> matplotlib.pyplot <span style=\"color: #cba6f7;\">as<\/span> plt\n<span style=\"color: #cba6f7;\">import<\/span> seaborn <span style=\"color: #cba6f7;\">as<\/span> sns\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">create a plot canvas<\/span>\n<span style=\"color: #cdd6f4;\">fig<\/span>, <span style=\"color: #cdd6f4;\">ax<\/span> <span style=\"color: #89dceb;\">=<\/span> plt.subplots(<span style=\"color: #fab387;\">1<\/span>,<span style=\"color: #fab387;\">1<\/span>)\n\n<span style=\"color: #6c7086;\">#<\/span><span style=\"color: #6c7086;\">first plot the points into our canvas<\/span>\nsns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>cons2.year, y<span style=\"color: #89dceb;\">=<\/span>cons2.methan, ax<span style=\"color: #89dceb;\">=<\/span>ax)\n\n<span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">then plot a line from the first to the last point on the same canvas<\/span>\n<span style=\"color: #cdd6f4;\">year0<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #f38ba8;\">min<\/span>(cons2.year)\n<span style=\"color: #cdd6f4;\">year1<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #f38ba8;\">max<\/span>(cons2.year)\nax.plot((year0,year1),(predict_methan(year0),predict_methan(year1)))\n<\/pre>\n<\/div>\n\n<div id=\"orgfe8c128\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/17438532d1c01292d94bc1d9411c8245ebaacac2.png?ssl=1\" alt=\"17438532d1c01292d94bc1d9411c8245ebaacac2.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org1518742\"><\/a> note: the polymorphism allows to properly use the prodict_methan function also with pandas Series <\/p>\n\n<p> <a id=\"org1cdf558\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-assessing-the-qaulity-of-a-regression\" class=\"outline-3\">\n<h3 id=\"assessing-the-qaulity-of-a-regression\">Assessing the quality of a regression<\/h3>\n<div class=\"outline-text-3\" id=\"text-assessing-the-qaulity-of-a-regression\">\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">residuals<\/span> <span style=\"color: #89dceb;\">=<\/span> cons2.methan <span style=\"color: #89dceb;\">-<\/span> predict_methan(cons2.year)\n<\/pre>\n<\/div>\n\n<p> <a id=\"orgd5987e4\"><\/a> looking at residuals distribution may show some pattern; in this case we may assume there is a better way to represent the relation between the features under investigation. <\/p>\n\n<p> In our example looks like there is no apparent pattern <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">ax<\/span> <span style=\"color: #89dceb;\">=<\/span> sns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>cons2.year, y<span style=\"color: #89dceb;\">=<\/span>residuals)\nax.plot((year0,year1),(<span style=\"color: #fab387;\">0<\/span>,<span style=\"color: #fab387;\">0<\/span>))\nax.set_ylabel(<span style=\"color: #a6e3a1;\">\"residuals\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\nText(0, 0.5, 'residuals')\n<\/pre>\n\n\n<div id=\"orgae0f136\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/2c93d21ae555706ae42752463532625d02c13d58.png?ssl=1\" alt=\"2c93d21ae555706ae42752463532625d02c13d58.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org6011417\"><\/a> The next step would be to assess the variance of residuals respect to the total variance of the distribution of the output variable Y: <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7Bvar%5B%5Cepsilon%5D%7D%7Bvar%5BY%5D%7D+%3D+%5Cfrac%7BE%5B%28%5Cepsilon+-+E%5B%5Cepsilon%5D%29%5E2%5D%7D%7BE%5B%28Y+-+E%5BY%5D%29%5E2%5D%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{var[&#92;epsilon]}{var[Y]} = &#92;frac{E[(&#92;epsilon - E[&#92;epsilon])^2]}{E[(Y - E[Y])^2]} \" class=\"latex\" \/> <\/p> \n\n<p> let&rsquo;s use <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Chat%7BY%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;hat{Y}\" class=\"latex\" \/> to represent the predicted values; by knowing that the mean of the residuals is 0 and their definition <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=E%5B%5Cepsilon%5D+%3D+0+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"E[&#92;epsilon] = 0 \" class=\"latex\" \/> <\/p> \n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cepsilon+%3D+Y+-+%5Chat%7BY%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;epsilon = Y - &#92;hat{Y} \" class=\"latex\" \/> <\/p> \n\n<p> we have <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cfrac%7Bvar%5B%5Cepsilon%5D%7D%7Bvar%5BY%5D%7D+%3D+%5Cfrac%7BE%5B%28Y+-+%5Chat%7BY%7D%29%5E2%5D%7D%7BE%5B%28Y+-+E%5BY%5D%29%5E2%5D%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;frac{var[&#92;epsilon]}{var[Y]} = &#92;frac{E[(Y - &#92;hat{Y})^2]}{E[(Y - E[Y])^2]} \" class=\"latex\" \/> <\/p> \n\n<p> now the quantity <\/p>\n\n<p style=\"text-align:center\"> <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2+%3D+1+-+%5Cfrac%7BE%5B%28Y+-+%5Chat%7BY%7D%29%5E2%5D%7D%7BE%5B%28Y+-+E%5BY%5D%29%5E2%5D%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R^2 = 1 - &#92;frac{E[(Y - &#92;hat{Y})^2]}{E[(Y - E[Y])^2]} \" class=\"latex\" \/> <\/p> \n\n<p> represent the fraction of the variance of the original dataset explained by the linear relation: this is a real number between 0 and 1 where 0 represents no actual explaination (i.e. the mean has the same prediction power) to 1 representing all the relation is explained <\/p>\n\n<p> <a id=\"org451d1ef\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-multiple-input-parameters\" class=\"outline-3\">\n<h3 id=\"multiple-input-parameters\">Multiple input parameters<\/h3>\n<div class=\"outline-text-3\" id=\"text-multiple-input-parameters\">\n<p> in order to perform this regression with multiple inputs we are going to use the <code>statmodels<\/code> library (see <a href=\"https:\/\/www.statsmodels.org\/stable\/index.html\">documentation<\/a>) <\/p>\n\n<p> Execute the following cell only the first time <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">!pip install statsmodels\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> statsmodels <span style=\"color: #cba6f7;\">as<\/span> sm\n<span style=\"color: #cba6f7;\">from<\/span> statsmodels.api <span style=\"color: #cba6f7;\">import<\/span> formula <span style=\"color: #cba6f7;\">as<\/span> smf\n<span style=\"color: #cba6f7;\">import<\/span> requests\n<span style=\"color: #cba6f7;\">import<\/span> pandas <span style=\"color: #cba6f7;\">as<\/span> pd\n<\/pre>\n<\/div>\n\n<p> <a id=\"org3a59db5\"><\/a> We will use a crime dataset from UCLA <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">headers<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #a6e3a1;\">\"crimerat maleteen south educ police60 police59 labor  males pop nonwhite unemp1  unemp2 median belowmed\"<\/span>.split()\n<span style=\"color: #cdd6f4;\">crime<\/span> <span style=\"color: #89dceb;\">=<\/span> pd.read_csv(\n    <span style=\"color: #a6e3a1;\">\"https:\/\/stats.idre.ucla.edu\/wp-content\/uploads\/2016\/02\/crime.txt\"<\/span>,\n    sep<span style=\"color: #89dceb;\">=<\/span>r<span style=\"color: #a6e3a1;\">\"\\s+\"<\/span>,\n    names<span style=\"color: #89dceb;\">=<\/span>headers,\n    dtype<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #f38ba8;\">float<\/span>\n)\n<\/pre>\n<\/div>\n\n<p> <a id=\"org478aaa7\"><\/a> This is the description of the content of this table <\/p>\n\n<table border=\"2\" cellspacing=\"0\" cellpadding=\"6\" rules=\"groups\" frame=\"hsides\">\n\n\n<colgroup>\n<col  class=\"org-left\" \/>\n\n<col  class=\"org-left\" \/>\n<\/colgroup>\n<thead>\n<tr>\n<th scope=\"col\" class=\"org-left\">Columns<\/th>\n<th scope=\"col\" class=\"org-left\">meaning<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">CrimeRat<\/td>\n<td class=\"org-left\">Crime rate: # of offenses reported to police per million population<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">MaleTeen<\/td>\n<td class=\"org-left\">The number of males of age 14-24 per 1000 population<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">South<\/td>\n<td class=\"org-left\">Indicator variable for Southern states (0 = No, 1 = Yes)<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Educ<\/td>\n<td class=\"org-left\">Mean # of years of schooling for rpersons of age 25 or older<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Police60<\/td>\n<td class=\"org-left\">1960 per capita expenditure on police by state and local government<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Police59<\/td>\n<td class=\"org-left\">1959 per capita expenditure on police by state and local government<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Labor<\/td>\n<td class=\"org-left\">Labor force participation rate per 1000 civilian urban males age 14-24<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Males<\/td>\n<td class=\"org-left\">The number of males per 1000 females<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Pop<\/td>\n<td class=\"org-left\">State population size in hundred thousands<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">NonWhite<\/td>\n<td class=\"org-left\">The number of non-whites per 1000 population<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Unemp1<\/td>\n<td class=\"org-left\">Unemployment rate of urban males per 1000 of age 14-24<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Unemp2<\/td>\n<td class=\"org-left\">Unemployment rate of urban males per 1000 of age 35-39<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">Median<\/td>\n<td class=\"org-left\">Median value of transferable goods and assets or family income in tens of $<\/td>\n<\/tr>\n\n<tr>\n<td class=\"org-left\">BelowMed<\/td>\n<td class=\"org-left\">The number of families per 1000 earning below 1\/2 the median income<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">crime.head()\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n   crimerat  maleteen  south  educ  police60  police59  labor   males    pop  \\\n0      79.1     151.0    1.0   9.1      58.0      56.0  510.0   950.0   33.0   \n1     163.5     143.0    0.0  11.3     103.0      95.0  583.0  1012.0   13.0   \n2      57.8     142.0    1.0   8.9      45.0      44.0  533.0   969.0   18.0   \n3     196.9     136.0    0.0  12.1     149.0     141.0  577.0   994.0  157.0   \n4     123.4     141.0    0.0  12.1     109.0     101.0  591.0   985.0   18.0   \n   nonwhite  unemp1  unemp2  median  belowmed  \n0     301.0   108.0    41.0   394.0     261.0  \n1     102.0    96.0    36.0   557.0     194.0  \n2     219.0    94.0    33.0   318.0     250.0  \n3      80.0   102.0    39.0   673.0     167.0  \n4      30.0    91.0    20.0   578.0     174.0  \n<\/pre>\n\n\n<p> <a id=\"org99adeeb\"><\/a> The <code>south<\/code> feature is actually categorical and cannot be treated in the same way as others but let&rsquo;s pretend it is not different for this exercise <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">crime.describe()\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n         crimerat    maleteen      south      educ    police60    police59  \\\ncount   47.000000   47.000000  47.000000  47.00000   47.000000   47.000000   \nmean    90.508511  138.574468   0.340426  10.56383   85.000000   80.234043   \nstd     38.676270   12.567634   0.478975   1.11870   29.718974   27.961319   \nmin     34.200000  119.000000   0.000000   8.70000   45.000000   41.000000   \n25%     65.850000  130.000000   0.000000   9.75000   62.500000   58.500000   \n50%     83.100000  136.000000   0.000000  10.80000   78.000000   73.000000   \n75%    105.750000  146.000000   1.000000  11.45000  104.500000   97.000000   \nmax    199.300000  177.000000   1.000000  12.20000  166.000000  157.000000   \n            labor        males         pop    nonwhite      unemp1     unemp2  \\\ncount   47.000000    47.000000   47.000000   47.000000   47.000000  47.000000   \nmean   561.191489   983.021277   36.617021  101.127660   95.468085  33.978723   \nstd     40.411814    29.467365   38.071188  102.828819   18.028783   8.445450   \nmin    480.000000   934.000000    3.000000    2.000000   70.000000  20.000000   \n25%    530.500000   964.500000   10.000000   24.000000   80.500000  27.500000   \n50%    560.000000   977.000000   25.000000   76.000000   92.000000  34.000000   \n75%    593.000000   992.000000   41.500000  132.500000  104.000000  38.500000   \nmax    641.000000  1071.000000  168.000000  423.000000  142.000000  58.000000   \n           median    belowmed  \ncount   47.000000   47.000000  \nmean   525.382979  194.000000  \nstd     96.490944   39.896061  \nmin    288.000000  126.000000  \n25%    459.500000  165.500000  \n50%    537.000000  176.000000  \n75%    591.500000  227.500000  \nmax    689.000000  276.000000  \n<\/pre>\n\n\n<p> <a id=\"org756457a\"><\/a> Note that there are some very skewed distributions like the non white which has a very large standard deviation respect to the mean; this value also shows a long queue according to the percentiles. <\/p>\n\n<p> Moreover, due to their definitions some features have very different ranges. <\/p>\n\n<p> This may have an impact in evaluating the eigenvectors as some dimensions may appear as more relevant then others due to their scale. <\/p>\n\n<p> For these reasons we may expect that renormalizing all distributions respect to their standard deviation may change our findings. <\/p>\n\n<p> <a id=\"org7b836ed\"><\/a> <\/p>\n<\/div>\n<div id=\"outline-container-evaluating-correlations-and-covariance\" class=\"outline-4\">\n<h4 id=\"evaluating-correlations-and-covariance\">Evaluating correlations and covariance<\/h4>\n<div class=\"outline-text-4\" id=\"text-evaluating-correlations-and-covariance\">\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> numpy <span style=\"color: #cba6f7;\">as<\/span> np\n<span style=\"color: #cdd6f4;\">crime_array<\/span> <span style=\"color: #89dceb;\">=<\/span> np.transpose(np.array(crime))\n<span style=\"color: #cdd6f4;\">covariance<\/span> <span style=\"color: #89dceb;\">=<\/span> np.cov(crime_array)\n<span style=\"color: #cdd6f4;\">correlation<\/span> <span style=\"color: #89dceb;\">=<\/span> np.corrcoef(crime_array)\npd.DataFrame({<span style=\"color: #a6e3a1;\">\"correlation\"<\/span>:correlation[<span style=\"color: #fab387;\">0<\/span>,<span style=\"color: #fab387;\">1<\/span>:],<span style=\"color: #a6e3a1;\">\"features\"<\/span>:headers[<span style=\"color: #fab387;\">1<\/span>:]})\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n    correlation  features\n0     -0.089472  maleteen\n1     -0.090637     south\n2      0.322835      educ\n3      0.687604  police60\n4      0.666714  police59\n5      0.188866     labor\n6      0.213914     males\n7      0.337474       pop\n8      0.032599  nonwhite\n9     -0.050478    unemp1\n10     0.177321    unemp2\n11     0.441320    median\n12    -0.179024  belowmed\n<\/pre>\n\n\n<p> <a id=\"orgb198098\"><\/a> apparently the crime rate most relevant correlation seems to be the increase in police expenditure which may probably be more a consequence than a causation <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">from<\/span> numpy.linalg._linalg <span style=\"color: #cba6f7;\">import<\/span> EigResult\n<span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">eigenvectors will be returned already sorted from the most to the least relevant<\/span>\n<span style=\"color: #cdd6f4;\">result<\/span> :EigResult <span style=\"color: #89dceb;\">=<\/span> np.linalg.eig(covariance)\n\n<span style=\"color: #cba6f7;\">def<\/span> <span style=\"color: #89b4fa;\">relevant<\/span>(headers: [<span style=\"color: #f38ba8;\">str<\/span>], result: EigResult, rank: <span style=\"color: #f38ba8;\">int<\/span>):\n    <span style=\"color: #6c7086;\">\"\"\"retruns the features of the rank-th eigenvalue sorted from the largest descending\"\"\"<\/span>\n    <span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">extract the rank-th eigenvector<\/span>\n    <span style=\"color: #cdd6f4;\">vector<\/span> <span style=\"color: #89dceb;\">=<\/span> result.eigenvectors[:,rank] \n    <span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">square it to get rid of sign<\/span>\n    <span style=\"color: #cdd6f4;\">vector_sq<\/span> <span style=\"color: #89dceb;\">=<\/span> vector <span style=\"color: #89dceb;\">*<\/span> vector\n    <span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">get the order from smallest to largest<\/span>\n    <span style=\"color: #cdd6f4;\">order<\/span> <span style=\"color: #89dceb;\">=<\/span> vector_sq.argsort()\n    <span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">reverse order and return the features from the most relevant<\/span>\n    <span style=\"color: #cba6f7;\">return<\/span> [headers[<span style=\"color: #f38ba8;\">int<\/span>(i)] <span style=\"color: #cba6f7;\">for<\/span> i <span style=\"color: #cba6f7;\">in<\/span> <span style=\"color: #f38ba8;\">reversed<\/span>(order)]\n<\/pre>\n<\/div>\n\n<p> <a id=\"org9ccbfb3\"><\/a> let&rsquo;s grab the 5 most relevant set of features <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">for<\/span> i <span style=\"color: #cba6f7;\">in<\/span> <span style=\"color: #f38ba8;\">range<\/span>(<span style=\"color: #fab387;\">5<\/span>):\n    <span style=\"color: #f38ba8;\">print<\/span>(relevant(headers, result, i))\n<\/pre>\n<\/div>\n\n<em><\/em>\n<pre class=\"example\" id=\"nil\">\n['nonwhite', 'median', 'belowmed', 'police60', 'police59', 'labor', 'crimerat', 'maleteen', 'males', 'pop', 'unemp1', 'educ', 'south', 'unemp2']\n['nonwhite', 'median', 'crimerat', 'pop', 'police60', 'police59', 'males', 'belowmed', 'labor', 'unemp1', 'unemp2', 'maleteen', 'south', 'educ']\n['labor', 'males', 'pop', 'crimerat', 'nonwhite', 'belowmed', 'maleteen', 'unemp2', 'unemp1', 'police59', 'educ', 'police60', 'median', 'south']\n['pop', 'crimerat', 'median', 'belowmed', 'nonwhite', 'labor', 'police60', 'police59', 'males', 'unemp1', 'unemp2', 'maleteen', 'educ', 'south']\n['labor', 'crimerat', 'pop', 'males', 'unemp1', 'unemp2', 'belowmed', 'police60', 'nonwhite', 'police59', 'median', 'maleteen', 'south', 'educ']\n<\/pre>\n\n<p> <a id=\"orgc119d4c\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-performing-regression-from-multiple-inputs\" class=\"outline-4\">\n<h4 id=\"performing-regression-from-multiple-inputs\">Performing regression from multiple inputs<\/h4>\n<div class=\"outline-text-4\" id=\"text-performing-regression-from-multiple-inputs\">\n<p> In the following multilinear correlation we construct a formula representing the features which may impact to the expected output <\/p>\n\n<em><\/em>\n<pre class=\"example\" id=\"nil\">\noutput ~ feature1 + feature2 + feature3\n<\/pre>\n\n<p> I chose to use all of the features which appear as most relevant in the first eigenvector and appear before our output <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">formula<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #a6e3a1;\">\"crimerat ~ \"<\/span><span style=\"color: #89dceb;\">+<\/span> (<span style=\"color: #a6e3a1;\">\" + \"<\/span>.join(relevant(headers, result, <span style=\"color: #fab387;\">0<\/span>)[:<span style=\"color: #fab387;\">6<\/span>]))\n<span style=\"color: #f38ba8;\">print<\/span>(formula)\n<span style=\"color: #cdd6f4;\">model<\/span> <span style=\"color: #89dceb;\">=<\/span> smf.ols(formula,crime)\n<span style=\"color: #cdd6f4;\">regression<\/span> <span style=\"color: #89dceb;\">=<\/span> model.fit()\nregression.summary()\n<\/pre>\n<\/div>\n\n<em><\/em>\n<pre class=\"example\" id=\"nil\">\ncrimerat ~ nonwhite + median + belowmed + police60 + police59 + labor\n<\/pre>\n\n<pre class=\"example\">\n&lt;class 'statsmodels.iolib.summary.Summary'&gt;\n\"\"\"\n                            OLS Regression Results                            \n==============================================================================\nDep. Variable:               crimerat   R-squared:                       0.638\nModel:                            OLS   Adj. R-squared:                  0.584\nMethod:                 Least Squares   F-statistic:                     11.75\nDate:                Sun, 05 Jan 2025   Prob (F-statistic):           1.48e-07\nTime:                        21:48:11   Log-Likelihood:                -214.10\nNo. Observations:                  47   AIC:                             442.2\nDf Residuals:                      40   BIC:                             455.2\nDf Model:                           6                                         \nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          t      P&gt;|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept   -304.9695     96.968     -3.145      0.003    -500.950    -108.989\nnonwhite       0.0050      0.056      0.088      0.930      -0.109       0.119\nmedian         0.1588      0.112      1.419      0.164      -0.067       0.385\nbelowmed       0.6875      0.223      3.085      0.004       0.237       1.138\npolice60       1.3928      1.140      1.222      0.229      -0.910       3.696\npolice59      -0.3685      1.239     -0.297      0.768      -2.872       2.135\nlabor          0.1592      0.100      1.594      0.119      -0.043       0.361\n==============================================================================\nOmnibus:                        2.339   Durbin-Watson:                   2.004\nProb(Omnibus):                  0.311   Jarque-Bera (JB):                1.581\nSkew:                          -0.436   Prob(JB):                        0.454\nKurtosis:                       3.220   Cond. No.                     2.16e+04\n==============================================================================\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The condition number is large, 2.16e+04. This might indicate that there are\nstrong multicollinearity or other numerical problems.\n\"\"\"\n<\/pre>\n\n\n<p> <a id=\"org9e0c1cb\"><\/a> The result of the fit method which is shown here above displays a wealth of information; most notably <\/p>\n\n<ul class=\"org-ul\">\n<li>some quality evaluation of the regression e.g. <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R^2\" class=\"latex\" \/><\/li>\n<li>all the evaluated parameters and the intercept<\/li>\n<\/ul>\n\n<p> <a id=\"org7d16e8a\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-exploring-features\" class=\"outline-4\">\n<h4 id=\"exploring-features\">Exploring features<\/h4>\n<div class=\"outline-text-4\" id=\"text-exploring-features\">\n<p> it is also important to not blindly accept the result of a regression without a further analysis of the dataset <\/p>\n\n<p> <a id=\"org2573c2f\"><\/a> In the following code I will check how the output variable depends on the features we examined; as this plot does not really show the interdipendence of all features some images may be difficult to interpret <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fig<\/span>, <span style=\"color: #cdd6f4;\">axs<\/span> <span style=\"color: #89dceb;\">=<\/span> mpl.subplots(<span style=\"color: #fab387;\">1<\/span>,<span style=\"color: #fab387;\">6<\/span>,sharey<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #fab387;\">True<\/span>,figsize<span style=\"color: #89dceb;\">=<\/span>(<span style=\"color: #fab387;\">18<\/span>,<span style=\"color: #fab387;\">3<\/span>))\n<span style=\"color: #cdd6f4;\">features<\/span> <span style=\"color: #89dceb;\">=<\/span> relevant(headers, result, <span style=\"color: #fab387;\">0<\/span>)[:<span style=\"color: #fab387;\">6<\/span>]\n<span style=\"color: #cba6f7;\">for<\/span> i <span style=\"color: #cba6f7;\">in<\/span> <span style=\"color: #f38ba8;\">range<\/span>(<span style=\"color: #fab387;\">6<\/span>):\n    sns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>crime[features[i]],y<span style=\"color: #89dceb;\">=<\/span>crime.crimerat,ax<span style=\"color: #89dceb;\">=<\/span>axs[i])\n<\/pre>\n<\/div>\n\n<div id=\"org5db72f8\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/7bf15c2d5162ced64037199385e5dbd9f6e0502f.png?ssl=1\" alt=\"7bf15c2d5162ced64037199385e5dbd9f6e0502f.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org952e527\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-correcting-eigenvector-bias-with-correlation-matrix\" class=\"outline-4\">\n<h4 id=\"correcting-eigenvector-bias-with-correlation-matrix\">Correcting eigenvector bias with correlation matrix<\/h4>\n<div class=\"outline-text-4\" id=\"text-correcting-eigenvector-bias-with-correlation-matrix\">\n<p> by using the correlation instead of the covariance, the range of all features is normalized now between -1 and 1 <\/p>\n\n<p> As we can see the most interesting eigenvectors change <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">result2<\/span> <span style=\"color: #89dceb;\">=<\/span> np.linalg.eig(correlation)\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">for<\/span> i <span style=\"color: #cba6f7;\">in<\/span> <span style=\"color: #f38ba8;\">range<\/span>(<span style=\"color: #fab387;\">5<\/span>):\n    <span style=\"color: #f38ba8;\">print<\/span>(relevant(headers, result2, i))\n<\/pre>\n<\/div>\n\n<em><\/em>\n<pre class=\"example\" id=\"nil\">\n['median', 'belowmed', 'educ', 'police59', 'police60', 'south', 'maleteen', 'nonwhite', 'crimerat', 'labor', 'males', 'pop', 'unemp1', 'unemp2']\n['pop', 'labor', 'unemp2', 'males', 'police60', 'police59', 'nonwhite', 'crimerat', 'south', 'educ', 'median', 'belowmed', 'unemp1', 'maleteen']\n['unemp1', 'unemp2', 'labor', 'maleteen', 'crimerat', 'males', 'nonwhite', 'police59', 'police60', 'pop', 'south', 'educ', 'belowmed', 'median']\n['males', 'crimerat', 'maleteen', 'labor', 'nonwhite', 'belowmed', 'unemp1', 'pop', 'south', 'unemp2', 'police60', 'police59', 'educ', 'median']\n['pop', 'labor', 'belowmed', 'south', 'maleteen', 'police59', 'median', 'police60', 'unemp2', 'educ', 'unemp1', 'nonwhite', 'crimerat', 'males']\n<\/pre>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">rank_no<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #fab387;\">0<\/span>\n<span style=\"color: #cdd6f4;\">features_count<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #fab387;\">8<\/span>\n<span style=\"color: #cdd6f4;\">formula<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #a6e3a1;\">\"crimerat ~ \"<\/span><span style=\"color: #89dceb;\">+<\/span> (<span style=\"color: #a6e3a1;\">\" + \"<\/span>.join(relevant(headers, result2, rank_no)[:features_count]))\n<span style=\"color: #f38ba8;\">print<\/span>(formula)\n<span style=\"color: #cdd6f4;\">model<\/span> <span style=\"color: #89dceb;\">=<\/span> smf.ols(formula,crime)\n<span style=\"color: #cdd6f4;\">regression<\/span> <span style=\"color: #89dceb;\">=<\/span> model.fit()\nregression.summary()\n<\/pre>\n<\/div>\n\n<em><\/em>\n<pre class=\"example\" id=\"nil\">\ncrimerat ~ median + belowmed + educ + police59 + police60 + south + maleteen + nonwhite\n<\/pre>\n\n<pre class=\"example\">\n&lt;class 'statsmodels.iolib.summary.Summary'&gt;\n\"\"\"\n                            OLS Regression Results                            \n==============================================================================\nDep. Variable:               crimerat   R-squared:                       0.730\nModel:                            OLS   Adj. R-squared:                  0.673\nMethod:                 Least Squares   F-statistic:                     12.82\nDate:                Sat, 04 Jan 2025   Prob (F-statistic):           1.02e-08\nTime:                        21:27:44   Log-Likelihood:                -207.24\nNo. Observations:                  47   AIC:                             432.5\nDf Residuals:                      38   BIC:                             449.1\nDf Model:                           8                                         \nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          t      P&gt;|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept   -537.5940    108.276     -4.965      0.000    -756.786    -318.402\nmedian         0.1764      0.101      1.740      0.090      -0.029       0.382\nbelowmed       0.8438      0.211      3.994      0.000       0.416       1.271\neduc          14.4615      5.068      2.853      0.007       4.201      24.722\npolice59      -0.8715      1.099     -0.793      0.433      -3.096       1.353\npolice60       1.8952      1.015      1.868      0.069      -0.159       3.949\nsouth         -1.9020     12.426     -0.153      0.879     -27.057      23.253\nmaleteen       0.9286      0.379      2.451      0.019       0.161       1.696\nnonwhite      -0.0025      0.060     -0.041      0.967      -0.124       0.119\n==============================================================================\nOmnibus:                        0.285   Durbin-Watson:                   1.792\nProb(Omnibus):                  0.867   Jarque-Bera (JB):                0.010\nSkew:                          -0.016   Prob(JB):                        0.995\nKurtosis:                       3.064   Cond. No.                     2.02e+04\n==============================================================================\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The condition number is large, 2.02e+04. This might indicate that there are\nstrong multicollinearity or other numerical problems.\n\"\"\"\n<\/pre>\n\n\n<p> <a id=\"org468ae64\"><\/a> Interestingly this correlation shows a better <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R^2\" class=\"latex\" \/> than the previous one thus demonstrating the effectiveness of using normalized distributions <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">rank_no<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #fab387;\">0<\/span>\n<span style=\"color: #cdd6f4;\">features_count<\/span> <span style=\"color: #89dceb;\">=<\/span> <span style=\"color: #fab387;\">8<\/span>\n<span style=\"color: #cdd6f4;\">fig<\/span>, <span style=\"color: #cdd6f4;\">axs<\/span> <span style=\"color: #89dceb;\">=<\/span> mpl.subplots(<span style=\"color: #fab387;\">1<\/span>,features_count,sharey<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #fab387;\">True<\/span>,figsize<span style=\"color: #89dceb;\">=<\/span>(features_count <span style=\"color: #89dceb;\">*<\/span> <span style=\"color: #fab387;\">3<\/span>,<span style=\"color: #fab387;\">3<\/span>))\n<span style=\"color: #cdd6f4;\">features<\/span> <span style=\"color: #89dceb;\">=<\/span> relevant(headers, result2, <span style=\"color: #fab387;\">0<\/span>)[:features_count]\n<span style=\"color: #cba6f7;\">for<\/span> i <span style=\"color: #cba6f7;\">in<\/span> <span style=\"color: #f38ba8;\">range<\/span>(features_count):\n    sns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>crime[features[i]],y<span style=\"color: #89dceb;\">=<\/span>crime.crimerat,ax<span style=\"color: #89dceb;\">=<\/span>axs[i])\n<\/pre>\n<\/div>\n\n<div id=\"org2540d3b\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/6f91bad8d5ed8c17c2d3d73df1bb6af39a6066b6.png?ssl=1\" alt=\"6f91bad8d5ed8c17c2d3d73df1bb6af39a6066b6.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org2ef46f8\"><\/a> <\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-more-visualization-of-the-correlations\" class=\"outline-4\">\n<h4 id=\"more-visualization-of-the-correlations\">More visualization of the correlations<\/h4>\n<div class=\"outline-text-4\" id=\"text-more-visualization-of-the-correlations\">\n<p> in the following examples I will show a couple of scatter plots of the most relevant features and use colors for the output variable; while this visualization does not add a great insight, nonetheless can raise interesting questions about the mutual connections of the features <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">sns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>crime.belowmed,y<span style=\"color: #89dceb;\">=<\/span>crime[<span style=\"color: #a6e3a1;\">\"median\"<\/span>],hue<span style=\"color: #89dceb;\">=<\/span>crime.crimerat)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='belowmed', ylabel='median'&gt;\n<\/pre>\n\n\n<div id=\"orgc74105b\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/2703fe16dfa3b194b2581af121e89a578dfbbc5f.png?ssl=1\" alt=\"2703fe16dfa3b194b2581af121e89a578dfbbc5f.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"orgdcd86c9\"><\/a> This image shows that some of the highest crime rate seems to show in an area where economic indicators seems more favorable, which just demonstrates how complex and controversial this analysis may be: deciding which features to include may have important consequences. <\/p>\n\n<p> A 3d version of the same plot adding the education feature <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #6c7086;\">#<\/span><span style=\"color: #6c7086;\">sns.scatterplot(x=crime.belowmed,y=crime[\"median\"],hue=crime.crimerat)<\/span>\n<span style=\"color: #cba6f7;\">from<\/span> mpl_toolkits.mplot3d <span style=\"color: #cba6f7;\">import<\/span> Axes3D\nsns.set_style(<span style=\"color: #a6e3a1;\">\"whitegrid\"<\/span>, {<span style=\"color: #a6e3a1;\">'axes.grid'<\/span> : <span style=\"color: #fab387;\">False<\/span>})\n\n<span style=\"color: #cdd6f4;\">fig<\/span> <span style=\"color: #89dceb;\">=<\/span> plt.figure()\n\n<span style=\"color: #cdd6f4;\">ax<\/span> <span style=\"color: #89dceb;\">=<\/span> Axes3D(fig) \nfig.add_axes(ax)\n<span style=\"color: #cdd6f4;\">x<\/span><span style=\"color: #89dceb;\">=<\/span>crime.belowmed\n<span style=\"color: #cdd6f4;\">y<\/span><span style=\"color: #89dceb;\">=<\/span>crime[<span style=\"color: #a6e3a1;\">\"median\"<\/span>]\n<span style=\"color: #cdd6f4;\">z<\/span><span style=\"color: #89dceb;\">=<\/span>crime.educ\n\nax.scatter(x, y, z, c<span style=\"color: #89dceb;\">=<\/span>crime.crimerat, marker<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">'o'<\/span>)\nax.set_xlabel(<span style=\"color: #a6e3a1;\">'belowmed'<\/span>)\nax.set_ylabel(<span style=\"color: #a6e3a1;\">'median'<\/span>)\nax.set_zlabel(<span style=\"color: #a6e3a1;\">'educ'<\/span>)\nax\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes3D: xlabel='belowmed', ylabel='median', zlabel='educ'&gt;\n<\/pre>\n\n\n<div id=\"org8f3f886\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/3854155072a7f2c4eada3d14780607169b44cf4b.png?ssl=1\" alt=\"3854155072a7f2c4eada3d14780607169b44cf4b.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org5e4f4f4\"><\/a> <\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-non-linear-features\" class=\"outline-3\">\n<h3 id=\"non-linear-features\">Non-Linear features<\/h3>\n<div class=\"outline-text-3\" id=\"text-non-linear-features\">\n<p> the linearity of linear models is defined by the interaction between different features but this may be used in with non linear cases e.g. trying to fit a polynomial model <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #6c7086;\"># <\/span><span style=\"color: #6c7086;\">this library is used to read excel files<\/span>\n!pip install openpyxl\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> pandas <span style=\"color: #cba6f7;\">as<\/span> pd\n<\/pre>\n<\/div>\n\n<p> <a id=\"org753f608\"><\/a> The following dataset describes financial performances metrics across many countries <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">financial<\/span> <span style=\"color: #89dceb;\">=<\/span> pd.read_excel(<span style=\"color: #a6e3a1;\">\"20220909-global-financial-development-database.xlsx\"<\/span>,sheet_name<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"Data - August 2022\"<\/span>)\n<\/pre>\n<\/div>\n\n<p> <a id=\"orgd9e50a1\"><\/a> let&rsquo;s first set some attributes as categorical: we may use them eventually as a filter <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">for<\/span> col <span style=\"color: #cba6f7;\">in<\/span> [<span style=\"color: #a6e3a1;\">\"iso3\"<\/span>, <span style=\"color: #a6e3a1;\">\"iso2\"<\/span>, <span style=\"color: #a6e3a1;\">\"imfn\"<\/span>, <span style=\"color: #a6e3a1;\">\"country\"<\/span>, <span style=\"color: #a6e3a1;\">\"region\"<\/span>, <span style=\"color: #a6e3a1;\">\"income\"<\/span>]:\n    <span style=\"color: #cdd6f4;\">financial<\/span>[col] <span style=\"color: #89dceb;\">=<\/span> financial[col].astype(<span style=\"color: #a6e3a1;\">\"category\"<\/span>)\n<\/pre>\n<\/div>\n\n<p> <a id=\"org9ed8770\"><\/a> In this example we will focus on a particular financial metric <code>di01<\/code> <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">financial[[<span style=\"color: #a6e3a1;\">\"country\"<\/span>,<span style=\"color: #a6e3a1;\">\"region\"<\/span>,<span style=\"color: #a6e3a1;\">\"di01\"<\/span>]].describe(include<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"all\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n            country                 region         di01\ncount         13268                  13268  8594.000000\nunique          214                      7          NaN\ntop     Afghanistan  Europe &amp; Central Asia          NaN\nfreq             62                   3596          NaN\nmean            NaN                    NaN    37.321250\nstd             NaN                    NaN    34.811684\nmin             NaN                    NaN     0.010371\n25%             NaN                    NaN    13.054380\n50%             NaN                    NaN    26.018790\n75%             NaN                    NaN    50.293530\nmax             NaN                    NaN   304.574500\n<\/pre>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> seaborn <span style=\"color: #cba6f7;\">as<\/span> sns\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">sns.scatterplot(financial,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"di01\"<\/span>,hue<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"region\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='year', ylabel='di01'&gt;\n<\/pre>\n\n\n<div id=\"orgc11764d\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/2122593c0bf7ff090cb10fdfb1ac3e8efe2318db.png?ssl=1\" alt=\"2122593c0bf7ff090cb10fdfb1ac3e8efe2318db.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"org2d125f3\"><\/a> Let&rsquo;s first narrow it to a single country and show its dependency from time <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fin_italy<\/span> <span style=\"color: #89dceb;\">=<\/span> financial.loc[financial[<span style=\"color: #a6e3a1;\">\"country\"<\/span>]<span style=\"color: #89dceb;\">==<\/span><span style=\"color: #a6e3a1;\">\"Italy\"<\/span>,:]\nsns.scatterplot(fin_italy,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"di01\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='year', ylabel='di01'&gt;\n<\/pre>\n\n\n<div id=\"orgb9f5db3\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/c12d07712f919017f1734505751d4b3a2a05d72c.png?ssl=1\" alt=\"c12d07712f919017f1734505751d4b3a2a05d72c.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"orgcf05c20\"><\/a> This shows some kind of growing trend: let&rsquo;s first try a simple linear regression respect to the years <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">sns.regression.regplot(fin_italy,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"di01\"<\/span>)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='year', ylabel='di01'&gt;\n<\/pre>\n\n\n<div id=\"org6e19a8d\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/0ee7f5e0849cfb865c4f74a180c09d55409e0749.png?ssl=1\" alt=\"0ee7f5e0849cfb865c4f74a180c09d55409e0749.png\" \/> <\/p> <\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">from<\/span> scipy.stats <span style=\"color: #cba6f7;\">import<\/span> linregress\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">fin_italy[[<span style=\"color: #a6e3a1;\">\"year\"<\/span>,<span style=\"color: #a6e3a1;\">\"di01\"<\/span>]].describe()\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n              year       di01\ncount    62.000000  57.000000\nmean   1990.500000  65.643938\nstd      18.041619  13.908666\nmin    1960.000000  46.931830\n25%    1975.250000  54.484480\n50%    1990.500000  62.710020\n75%    2005.750000  75.462590\nmax    2021.000000  93.921490\n<\/pre>\n\n\n<p> <a id=\"orgbe47b74\"><\/a> We see this dataset does not contain metrics for all years so let&rsquo;s remove rows without values <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fin_italy_valued<\/span> <span style=\"color: #89dceb;\">=<\/span> fin_italy.loc[<span style=\"color: #89dceb;\">~<\/span>fin_italy.di01.isna(),[<span style=\"color: #a6e3a1;\">\"year\"<\/span>,<span style=\"color: #a6e3a1;\">\"di01\"<\/span>]]\n<\/pre>\n<\/div>\n\n<p> <a id=\"org93d07ba\"><\/a> Here we see the results of the regression <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">result<\/span> <span style=\"color: #89dceb;\">=<\/span> linregress(y<span style=\"color: #89dceb;\">=<\/span>fin_italy_valued.di01,x<span style=\"color: #89dceb;\">=<\/span>fin_italy_valued.year)\nresult\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\nLinregressResult(slope=np.float64(0.4768662371619365), intercept=np.float64(-884.1481148904821), rvalue=np.float64(0.5972453694029883), pvalue=np.float64(9.37869396904616e-07), stderr=np.float64(0.08635123015454191), intercept_stderr=np.float64(171.99538887670158))\n<\/pre>\n\n\n<p> <a id=\"orgd9a7842\"><\/a> The <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R^2\" class=\"latex\" \/> looks poor: <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">rsquare<\/span> <span style=\"color: #89dceb;\">=<\/span> result.rvalue <span style=\"color: #89dceb;\">**<\/span> <span style=\"color: #fab387;\">2<\/span>\nrsquare\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\nnp.float64(0.356702031273312)\n<\/pre>\n\n\n<p> <a id=\"org43d2b54\"><\/a> let&rsquo;s plot the residuals to see any clear behavior <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">residuals<\/span> <span style=\"color: #89dceb;\">=<\/span> fin_italy_valued.di01 <span style=\"color: #89dceb;\">-<\/span> (fin_italy_valued.year <span style=\"color: #89dceb;\">*<\/span> result.slope <span style=\"color: #89dceb;\">+<\/span> result.intercept)\nsns.scatterplot(x<span style=\"color: #89dceb;\">=<\/span>fin_italy_valued.year, y<span style=\"color: #89dceb;\">=<\/span>residuals)\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;Axes: xlabel='year', ylabel='None'&gt;\n<\/pre>\n\n\n<div id=\"orga0e1d34\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/548ce9a11737896121e37222ccda2425e1cad98c.png?ssl=1\" alt=\"548ce9a11737896121e37222ccda2425e1cad98c.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"orgc0bc260\"><\/a> <\/p>\n<\/div>\n<div id=\"outline-container-adding-nonlinear-features\" class=\"outline-4\">\n<h4 id=\"adding-nonlinear-features\">Adding nonlinear features<\/h4>\n<div class=\"outline-text-4\" id=\"text-adding-nonlinear-features\">\n<p> For simplicity of the fit we will use a column with years calculated as a difference from the first one. <\/p>\n\n<p> In this case residuals suggests a kind of oscillatory behavior but this is way too complex for this tutorial as it involves the evaluation of periods of the oscillations and phase shifts. <\/p>\n\n<p> The simpler way to increase the fit can be to use a higher degree polynomial. <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cba6f7;\">import<\/span> statsmodels <span style=\"color: #cba6f7;\">as<\/span> sm\n<span style=\"color: #cba6f7;\">from<\/span> statsmodels.api <span style=\"color: #cba6f7;\">import<\/span> formula <span style=\"color: #cba6f7;\">as<\/span> smf\n<span style=\"color: #cba6f7;\">import<\/span> requests\n<span style=\"color: #cba6f7;\">import<\/span> pandas <span style=\"color: #cba6f7;\">as<\/span> pd\n<\/pre>\n<\/div>\n\n<p> Let&rsquo;s create the nonlinear feature columns for a polynomial of degree 3. The higher the degree the lower the error: choosing an excessively large degree can lead to overfitting without adding much more insight <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fin_italy_valued<\/span>[<span style=\"color: #a6e3a1;\">\"dy\"<\/span>] <span style=\"color: #89dceb;\">=<\/span> fin_italy_valued.year <span style=\"color: #89dceb;\">-<\/span> <span style=\"color: #f38ba8;\">min<\/span>(fin_italy_valued.year)\n<span style=\"color: #cdd6f4;\">fin_italy_valued<\/span>[<span style=\"color: #a6e3a1;\">\"dy2\"<\/span>] <span style=\"color: #89dceb;\">=<\/span> fin_italy_valued.dy <span style=\"color: #89dceb;\">**<\/span> <span style=\"color: #fab387;\">2<\/span>\n<span style=\"color: #cdd6f4;\">fin_italy_valued<\/span>[<span style=\"color: #a6e3a1;\">\"dy3\"<\/span>] <span style=\"color: #89dceb;\">=<\/span> fin_italy_valued.dy <span style=\"color: #89dceb;\">**<\/span> <span style=\"color: #fab387;\">3<\/span>\n<\/pre>\n<\/div>\n\n<p> now we can fit and get the coefficients for these features <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">model<\/span> <span style=\"color: #89dceb;\">=<\/span> smf.ols(<span style=\"color: #a6e3a1;\">\"di01 ~ dy + dy2 + dy3\"<\/span>,fin_italy_valued)\n<span style=\"color: #cdd6f4;\">regression<\/span> <span style=\"color: #89dceb;\">=<\/span> model.fit()\nregression.summary()\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\n&lt;class 'statsmodels.iolib.summary.Summary'&gt;\n\"\"\"\n                            OLS Regression Results                            \n==============================================================================\nDep. Variable:                   di01   R-squared:                       0.598\nModel:                            OLS   Adj. R-squared:                  0.575\nMethod:                 Least Squares   F-statistic:                     26.30\nDate:                Sun, 19 Jan 2025   Prob (F-statistic):           1.48e-10\nTime:                        19:48:07   Log-Likelihood:                -204.44\nNo. Observations:                  57   AIC:                             416.9\nDf Residuals:                      53   BIC:                             425.1\nDf Model:                           3                                         \nCovariance Type:            nonrobust                                         \n==============================================================================\n                 coef    std err          t      P&gt;|t|      [0.025      0.975]\n------------------------------------------------------------------------------\nIntercept     69.6046      4.436     15.690      0.000      60.707      78.503\ndy            -1.8193      0.670     -2.715      0.009      -3.163      -0.475\ndy2            0.0614      0.027      2.262      0.028       0.007       0.116\ndy3           -0.0004      0.000     -1.350      0.183      -0.001       0.000\n==============================================================================\nOmnibus:                        6.613   Durbin-Watson:                   0.134\nProb(Omnibus):                  0.037   Jarque-Bera (JB):                3.703\nSkew:                           0.419   Prob(JB):                        0.157\nKurtosis:                       2.075   Cond. No.                     2.84e+05\n==============================================================================\nNotes:\n[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n[2] The condition number is large, 2.84e+05. This might indicate that there are\nstrong multicollinearity or other numerical problems.\n\"\"\"\n<\/pre>\n\n\n<p> The <img decoding=\"async\" src=\"https:\/\/s0.wp.com\/latex.php?latex=R%5E2&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"R^2\" class=\"latex\" \/> value improved from .39 to .59 <\/p>\n\n<p> Here are our coefficients: the third degree adds very little contribution <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">regression.params\n<\/pre>\n<\/div>\n\n<pre class=\"example\">\nIntercept    69.604622\ndy           -1.819269\ndy2           0.061431\ndy3          -0.000417\ndtype: float64\n<\/pre>\n\n\n<p> With them we can now plot the polynomial and verify the new fit <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fin_italy_valued<\/span>[<span style=\"color: #a6e3a1;\">\"predicted\"<\/span>] <span style=\"color: #89dceb;\">=<\/span> regression.params[<span style=\"color: #fab387;\">0<\/span>] <span style=\"color: #89dceb;\">+<\/span> \\\n    regression.params[<span style=\"color: #fab387;\">1<\/span>] <span style=\"color: #89dceb;\">*<\/span> fin_italy_valued.dy <span style=\"color: #89dceb;\">+<\/span> \\\n    regression.params[<span style=\"color: #fab387;\">2<\/span>] <span style=\"color: #89dceb;\">*<\/span> fin_italy_valued.dy2 <span style=\"color: #89dceb;\">+<\/span> \\\n    regression.params[<span style=\"color: #fab387;\">3<\/span>] <span style=\"color: #89dceb;\">*<\/span> fin_italy_valued.dy3\n<\/pre>\n<\/div>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\"><span style=\"color: #cdd6f4;\">fig<\/span>, <span style=\"color: #cdd6f4;\">ax<\/span> <span style=\"color: #89dceb;\">=<\/span> plt.subplots(<span style=\"color: #fab387;\">1<\/span>, <span style=\"color: #fab387;\">1<\/span>)\nsns.scatterplot(data<span style=\"color: #89dceb;\">=<\/span>fin_italy_valued,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"di01\"<\/span>, ax<span style=\"color: #89dceb;\">=<\/span>ax)\nsns.lineplot(data<span style=\"color: #89dceb;\">=<\/span>fin_italy_valued,x<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"year\"<\/span>,y<span style=\"color: #89dceb;\">=<\/span><span style=\"color: #a6e3a1;\">\"predicted\"<\/span>, ax<span style=\"color: #89dceb;\">=<\/span>ax)\n<\/pre>\n<\/div>\n\n<div id=\"orge9784b9\" class=\"figure\"> <p><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/623a675dc051431286d08cc57c923acaf8fce7d0.png?ssl=1\" alt=\"623a675dc051431286d08cc57c923acaf8fce7d0.png\" \/> <\/p> <\/div>\n\n<p> <a id=\"orgaf6228a\"><\/a> <\/p>\n\n<div class=\"org-src-container\">\n<label class=\"org-src-name\"><em><\/em><\/label>\n<pre class=\"src src-python\" id=\"nil\">\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"We made quite a journey so far! Starting from Jupyter and Pandas we explored our datasets and created independent scripts.\n\nIt is now the time to learn the basics of a very powerful tool: Linear Regression.\n","protected":false},"author":1,"featured_media":663,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","inline_featured_image":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[4],"tags":[7],"class_list":["post-572","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-language-learning","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Hold the Line - Noise On The Net<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hold the Line - Noise On The Net\" \/>\n<meta property=\"og:description\" content=\"We made quite a journey so far! Starting from Jupyter and Pandas we explored our datasets and created independent scripts. It is now the time to learn the basics of a very powerful tool: Linear Regression.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/\" \/>\n<meta property=\"og:site_name\" content=\"Noise On The Net\" \/>\n<meta property=\"article:published_time\" content=\"2025-02-16T15:48:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-02-16T15:48:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"801\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"marco.p.v.vezzoli\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"marco.p.v.vezzoli\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/\"},\"author\":{\"name\":\"marco.p.v.vezzoli\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#\\\/schema\\\/person\\\/88c3a70f2b23480197bc61d6e1e2e982\"},\"headline\":\"Hold the Line\",\"datePublished\":\"2025-02-16T15:48:38+00:00\",\"dateModified\":\"2025-02-16T15:48:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/\"},\"wordCount\":1683,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#\\\/schema\\\/person\\\/88c3a70f2b23480197bc61d6e1e2e982\"},\"image\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/noiseonthenet.space\\\/noise\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1\",\"keywords\":[\"Python\"],\"articleSection\":[\"Language learning\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/\",\"url\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/\",\"name\":\"Hold the Line - Noise On The Net\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/noiseonthenet.space\\\/noise\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1\",\"datePublished\":\"2025-02-16T15:48:38+00:00\",\"dateModified\":\"2025-02-16T15:48:40+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#primaryimage\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/noiseonthenet.space\\\/noise\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/noiseonthenet.space\\\/noise\\\/wp-content\\\/uploads\\\/2025\\\/02\\\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1\",\"width\":1200,\"height\":801},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/2025\\\/02\\\/hold-the-line\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Hold the Line\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#website\",\"url\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/\",\"name\":\"Noise On The Net\",\"description\":\"Sharing adventures in code\",\"publisher\":{\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#\\\/schema\\\/person\\\/88c3a70f2b23480197bc61d6e1e2e982\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/#\\\/schema\\\/person\\\/88c3a70f2b23480197bc61d6e1e2e982\",\"name\":\"marco.p.v.vezzoli\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g\",\"caption\":\"marco.p.v.vezzoli\"},\"logo\":{\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g\"},\"description\":\"Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions\",\"sameAs\":[\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/in\\\/marco-paolo-valerio-vezzoli-0663835\\\/\"],\"url\":\"https:\\\/\\\/noiseonthenet.space\\\/noise\\\/author\\\/marco-p-v-vezzoli\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hold the Line - Noise On The Net","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/","og_locale":"en_US","og_type":"article","og_title":"Hold the Line - Noise On The Net","og_description":"We made quite a journey so far! Starting from Jupyter and Pandas we explored our datasets and created independent scripts. It is now the time to learn the basics of a very powerful tool: Linear Regression.","og_url":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/","og_site_name":"Noise On The Net","article_published_time":"2025-02-16T15:48:38+00:00","article_modified_time":"2025-02-16T15:48:40+00:00","og_image":[{"width":1200,"height":801,"url":"https:\/\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg","type":"image\/jpeg"}],"author":"marco.p.v.vezzoli","twitter_card":"summary_large_image","twitter_misc":{"Written by":"marco.p.v.vezzoli","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#article","isPartOf":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/"},"author":{"name":"marco.p.v.vezzoli","@id":"https:\/\/noiseonthenet.space\/noise\/#\/schema\/person\/88c3a70f2b23480197bc61d6e1e2e982"},"headline":"Hold the Line","datePublished":"2025-02-16T15:48:38+00:00","dateModified":"2025-02-16T15:48:40+00:00","mainEntityOfPage":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/"},"wordCount":1683,"commentCount":0,"publisher":{"@id":"https:\/\/noiseonthenet.space\/noise\/#\/schema\/person\/88c3a70f2b23480197bc61d6e1e2e982"},"image":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1","keywords":["Python"],"articleSection":["Language learning"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/","url":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/","name":"Hold the Line - Noise On The Net","isPartOf":{"@id":"https:\/\/noiseonthenet.space\/noise\/#website"},"primaryImageOfPage":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#primaryimage"},"image":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1","datePublished":"2025-02-16T15:48:38+00:00","dateModified":"2025-02-16T15:48:40+00:00","breadcrumb":{"@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#primaryimage","url":"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1","contentUrl":"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1","width":1200,"height":801},{"@type":"BreadcrumbList","@id":"https:\/\/noiseonthenet.space\/noise\/2025\/02\/hold-the-line\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/noiseonthenet.space\/noise\/"},{"@type":"ListItem","position":2,"name":"Hold the Line"}]},{"@type":"WebSite","@id":"https:\/\/noiseonthenet.space\/noise\/#website","url":"https:\/\/noiseonthenet.space\/noise\/","name":"Noise On The Net","description":"Sharing adventures in code","publisher":{"@id":"https:\/\/noiseonthenet.space\/noise\/#\/schema\/person\/88c3a70f2b23480197bc61d6e1e2e982"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/noiseonthenet.space\/noise\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/noiseonthenet.space\/noise\/#\/schema\/person\/88c3a70f2b23480197bc61d6e1e2e982","name":"marco.p.v.vezzoli","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g","caption":"marco.p.v.vezzoli"},"logo":{"@id":"https:\/\/secure.gravatar.com\/avatar\/b9d9aab1df560bc14d73b0b442198f196ce39e7c7a38df1dc22fec0b97f17da9?s=96&d=mm&r=g"},"description":"Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions","sameAs":["https:\/\/noiseonthenet.space\/noise\/","https:\/\/www.linkedin.com\/in\/marco-paolo-valerio-vezzoli-0663835\/"],"url":"https:\/\/noiseonthenet.space\/noise\/author\/marco-p-v-vezzoli\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/noiseonthenet.space\/noise\/wp-content\/uploads\/2025\/02\/sam-goodgame-Pe5BC-EDtB4-unsplash.jpg?fit=1200%2C801&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pdDUZ5-9e","jetpack-related-posts":[],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/posts\/572","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/comments?post=572"}],"version-history":[{"count":5,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/posts\/572\/revisions"}],"predecessor-version":[{"id":686,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/posts\/572\/revisions\/686"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/media\/663"}],"wp:attachment":[{"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/media?parent=572"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/categories?post=572"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/noiseonthenet.space\/noise\/wp-json\/wp\/v2\/tags?post=572"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}