Oxidizing Lagrange Polynomials for Machine Learning

jay-heike-eNXEbyvTzJA-unsplash.jpg

Photo by Jay Heike on Unsplash

In recent years, machine learning techniques popolarized well known algorithms like linear regression, even if in a larger data context. Polynomial regression is one of these examples.

In the past, researchers were defing formulae upfront and then regression was applied to fit model parameters. The machine learning approach in this case consists of using the data to evaluate which degree of the polynomial does best fit the observations without overfitting.

The code for this post is available here and is connected with a previous post Polynomials ring

If you see any error or woud like to suggest a more efficient implementation, please let me know in the comments

What are Lagrange Polynomials

Lagrange polynomials are used to find an interpolate monovariate polynomial (i.e. a polynomial of a single variable only).

Given a list x of n points in R (we will call them nodes from now on) and a list y of real values associated with them it is possible to find at least one polynomial of degree n – 1 passing through them.

The problem is equivalent to the following linear problem

\left[ \begin{matrix} 1 & x_1 & x^2_1 & \dots & x^{n-1}_1 \\ 1 & x_2 & x^2_2 & \dots & x^{n-1}_2 \\ \dots \\ 1 & x_n & x^2_n & \dots & x^{n-1}_n \\ \end{matrix} \right] \times \left[ \begin{matrix} a_0 \\ a_1 \\ \dots \\ a_{n-1} \\ \end{matrix} \right] = \left[ \begin{matrix} y_0 \\ y_1 \\ \dots \\ y_{n-1} \\ \end{matrix} \right]

The lagrange Polynomial base (denoted here by l_i[x]) is a way to transform the problem into an equivalent linear problem with a diagonal matrix, i.e.

I \times \left[ \begin{matrix} b_0 \\ b_1 \\ \dots \\ b_{n-1} \\ \end{matrix} \right] = \left[ \begin{matrix} y_0 \\ y_1 \\ \dots \\ y_{n-1} \\ \end{matrix} \right]

so the solution becomes

P[x] = \sum_{i}b_i l_i[x] = \sum_{i}y_i l_i[x]

Each polynomial is designed so to have roots in all nodes but one; in that node it will be equal to 1

Here is a simple example: suppose we have oly two nodes in 0 and 1: the Lagrange polynomial base will be composed of the follwing:

P_0[x] = 1 - x

P_1[x] = x

degree1.png

if we have say y(0) = 1 and y(1) = 3 the interpolating polynomial becomes

P[x] = 1 * P_0[x] + 3 * P_1[x] = (1 - x) + 3x = 2x + 1

In the same way if the nodes are placed in -1, 0 and 1 we have:

P_{-1}[x] = \frac{1}{2} x (x-1)

P_0[x] = -1 (x+1)(x-1)

P_1[x] = \frac{1}{2} (x+1) x

degree2.png

Why should Lagrange Polynomials matter in Machine Learning

polynomial regression is a typical example of how machine learning instructed existing knowledge.

While polynomial fitting is a well known application of linear regression, the introduction of concepts such overfitting, regularization, validation etc. clarified greatly how to better use these tools.

As these are iterative approaches, they require a repeated evaluation of the fitting weights; thus any way to make this calculation faster can greatly impact in the overall efficiency of the process.

As lagrange polynomials provide an efficient space base to perform these calculations this approach can have a great application

Let’s set an example context: suppose we have a large number of samples of the form (x,y) where x \in R is the independent variable or feature and y \in R is the dependent variable or target

We may want to partition the range where x spans with n compact, connected, semiclosed and limited subsets \sigma_i, i \in 1..n e.g.

\sigma_i = (\delta (i-1) + min(x), \delta i + min(x)]

\delta = \frac{max(x) - min(x)}{n}

In each \sigma_i we can identify a point \bar{x}_i which will be used to define the Lagrange basis. There is no assumption about which point to choose but it would be reasonable to select one which is about “in the middle” of the range (or if you prefer the one which minimizes the integral of squares of the distances with all others)

\bar{x_i} = argmin_{x \in \sigma_i}\int_{y \in \sigma_i}{(x-y)^2}dy

in our example that would be

\bar{x_i} = min(x) + \frac{2i-1}{2} \delta

The starting value in the i -th segment for the iteration would then be the mean of all the y_j whose x_j is in \sigma_i (where j is the sample id)

\bar{y_i} = E_{x_j \in \sigma_i}[y_j]

The error evaluation would then be

\epsilon = \sum_{j}{err(y_j-P[x_j])}

where the function err(r) would be our error metric (r is a residual).

Let’s assume

err(r) = r^2

and let’s also assume the lagrange polynomial with the initial values

P[x] = \sum_{i}\bar{y}_i l_i[x]

at each iteration we can calculate

\frac{d}{d\bar{y}_i}\epsilon = 2 \sum_{j \in samples}{l_i[x_j](\bar{y}_il_i[x_j]-y_j)}

which returns the gradient we need to step into the next iteration.

F statistics to identify chaotic behavior vs ordered behavior

We then need a metric to figure out if we have a very large cloud of points in each area (which may call for a caotic behavior) or the dependency looks quite good.

One possble approach is to compare wether the variation of the input values (response value) between each section (segment of the input feature) is comparable with the variation within each section.

I have been inspired by F statistics used in ANOVA: we are comparing the variation of the means of each section with the mean of the variations.

\chi = \frac{E[Var[Y_i]]}{Var[E[Y_i]]}

In a ordered behavior we expect the numerator to be quite a lot smaller than the denominator.

An efficent Rust implementation of Lagrange Polynomials

Weights calculation

Pre-calculating baricentral weights reduces the overall computation time.

As these depends only on the nodes they can be stored upfront in a dedicated data structure.

w_i = \prod_{1 \leq j \leq k, j \neq i}\frac{1}{(x_i-x_j)}

This is a possible implementation in Rust

impl Baricentric {
    pub fn new(points: & [f64]) -> Baricentric {
        let weights: Vec<f64> = points
            .iter()
            .enumerate()
            .map(|(i, x0)| {
                points
                    .iter()
                    .enumerate()
                    .filter(|(j, _)| *j != i)
                    .fold(1.0, |acc, (_, x1)| acc * 1.0 / (*x0 - *x1))
            })
            .collect();
        Baricentric {
            points: points.to_owned(),
            weights: weights,
        }
    }
}

Polar representation of Polynomials

It is possible to represent these polynomials by only listing the poles location and their weight

pub struct PolarPoly{
    poles: Vec<f64>,
    weight: f64
}

Then we may create them by either:

  1. pass these values directly
  2. calculate them from a baricentric object: as this contains the information we need for all of the basis, we may want to get each vector individually: this requires an API similar to the vector get method which returns an Option
impl PolarPoly{
    pub fn new(poles: & [f64], weight: f64) -> PolarPoly{
        PolarPoly{
            poles: poles.to_owned(),
            weight
        }
    }
    pub fn from_baricentric(bar: & Baricentric, i: usize) -> Option<PolarPoly>{
        let weight=*(bar.weights.get(i)?);
        let poles=bar.points
            .iter()
            .enumerate()
            .filter(|(j,_)|{
                *j != i
            })
            .map(|(_,(x, _))|{
                *x
            })
            .collect();
        Some(
            PolarPoly{
                poles,
                weight
            }
        )
    }
}

Note that each polar polynomial will have a degree equals to

deg[l_i[x]] = len(nodes) -1

The other operations we may want are:

  1. evaluate the polynomials in a given point of its domain
  2. transform them into regular polynomials to acquire the whole algebraic operation set and thus calculate a linear combination of the lagrange basis: but this require some more work that is explained in the next section
impl PolarPoly{
    pub fn eval(& self, x: f64) -> f64 {
        self.weight * self.poles
            .iter()
            .map(|p| x - p )
            .fold(1.0, |acc, val| val * acc)
    }
    pub fn to_poly(& self) -> poly.Poly<f64>{
        todo!("implement this stuff");
    }
}

here are two examples of output obtained plotting the basis generated for four poles.

In the first example poles are not equispaced and are placed in 1, 2, 5 and 5.5 . From the grid you may notice how each element of the basis has value 1 in only one pole and 0 in all others plot_base.png

In the second example there is an equispaced grid, poles are in 1, 2, 3 and 4. plot_equispaced.png

Connecting with regular Polynomials in Rust

We need an intermediate data structure to efficiently calculate the coefficients

The idea is to recursively multiply all of the components of degree 1; each multiplication is the sum of the current intermediate result times x (which is just as increasing the power of each monomial) and subtract with the same times the pole value:

P_{i+1}[x] = (x - p_{i+1})P_i[x]

P_{i+1}[x] = x P_i[x] - p_{i+1} P_i[x]

by analogy with the binary operation I call the first pass “shift”

P_{i+1}[x] = shift[P_i[x]] - p_{i+1} P_i[x]

This iterative sequence can be made with a fixed size data structure; unfortunately the size is known only at runtime so we may need to use the heap memory.

Iterating on all poles, each iteration will return an intermediate result which will have a degree increased by one: this can be simulated without moving any data but just moving backward the 0th degree coefficient position in the array.

We can start with a constant polynomial equal to 1; this is equivalent to an array of length 1 with just this value: [1]

in the first iteration we will get exactly the first polynomial

P_{1}[x] = 1 (x - p_1) = x - p_1

this is equivalent to an array like the following [-p_1, 1]

in the second iteration we have

P_{2}[x] = (x - p_1)(x - p_2) = x^2 + ((-p_1) + (-p_2)) x + (-p_1) (-p_2)

this is equivalent to an array like the following [(-p_1)(-p_2),-p_1-p_2, 1] etc.

The first element is always equal to 1; the others are obtained by subtracting the current value (which is 0 for the lowest cell) with the product of the next cell times the pole.

Once all multiplication have been performed we can multiply all coefficients by the weight.

Now our code looks like this:

    pub fn to_poly(& self) -> poly::Poly<f64> {
        // the polynomial degree should be = to the number of poles
        // thus the no. of coefficients should be equal to the no. of poles + 1
        let degree = self.poles.len();
        let mut coeff: Vec<f64> = vec![0.0; 1 + degree];

        // the highest coefficient should be equal to 1
        coeff[degree] = 1.0;

        // after each multiplication the degree is increased by one
        // this can be simulated by counting each iteration backward
        for (id, pole) in self.poles.iter().enumerate() {
            for j in (degree - id - 1)..degree {
                coeff[j] = coeff[j] - coeff[j + 1] * pole;
            }
        }

        // finelly mult by the weight and return
        poly::Poly::new(coeff.iter().map(|x| self.weight * (*x)).collect())
    }

Conclusions

Lagrange polynomials are quite an old tool but are still interesting in modern applications. I mentioned machine learning but another important field is Solomon-Reed error correction codes.

Developing some base structure in Rust has been an interesting journey into the language and optimization.

If you read this far, well I’m impressed! Let me know your comments.

marco.p.v.vezzoli

Self taught assembler programming at 11 on my C64 (1983). Never stopped since then -- always looking up for curious things in the software development, data science and AI. Linux and FOSS user since 1994. MSc in physics in 1996. Working in large semiconductor companies since 1997 (STM, Micron) developing analytics and full stack web infrastructures, microservices, ML solutions

You may also like...