How to implement a linear regression algorithm?

I need to implement a linear regression algorithm. Preferably it gives results equal to or close to the function trend (or TREND) of Excel.

I've found a lot of material that tries to explain the whole concept of linear regression, but that's not what I want. I would like a pseudo code that is directly translatable to an imperative programming language.

For those who did not use the function trend it works like this:

Given two vectors represent the x and the y in an ordered pair, (x,y) to that of the well-known values, for example, [ (1,10), (2,20), (3,30), (4,40) ], a value of x for which you want to find out the value of y corresponding to a function of TREND fits and the known values in the function, of the form y=mx+b, , and of the value of y. in This case, if I pass the value of 5 as the last argument to the function, I return 50.

Is anyone able to give a pseudo code of an algorithm that does this?

Author: Emerson Rocha, 2014-02-05

7 answers

I put together a small example in Java based on some examples I found on the Web.

I used the data you provided...

public class Main {

    public static void main(String[] args) {
        double[] x = {1, 2, 3, 4};
        double[] y = {10, 20, 30, 40};
        System.out.println(trend(y, x, 5));
    }

    public static double trend(double[] known_y, double[] known_x, double new_x)
    {
        double[] values = LeastSquaresFitLinear(known_y, known_x);
        return (values[0] * new_x) + values[1];
    }

    public static double[] LeastSquaresFitLinear(double[] known_y, double[] known_x)
    {
        double M, B;
        if (known_y.length != known_x.length)
        {
            return new double[]{0,0};
        }

        int numPoints = known_y.length;

        double x1, y1, xy, x2, J;

        x1 = y1 = xy = x2 = 0.0;
        for (int i = 0; i < numPoints; i++)
        {
            x1 = x1 + known_x[i];
            y1 = y1 + known_y[i];
            xy = xy + known_x[i] * known_y[i];
            x2 = x2 + known_x[i] * known_x[i];
        }

        M = B = 0;
        J = ((double)numPoints * x2) - (x1 * x1);

        if (J != 0.0)
        {
            M = (((double)numPoints * xy) - (x1 * y1)) / J;
            B = ((y1 * x2) - (x1 * xy)) / J;
        }
        return new double[]{M,B};
    }

}
 8
Author: Marcelo Haßlocher, 2014-02-05 17:35:06

The answer from @Marcos Banik can be translated for example to Python with minimal effort:

>>> x = [1,2,3,4]
>>> y = [10,20,30,40]
>>> m = sum(a*b for (a,b) in zip(x,y)) - sum(x) * sum(y) / len(x)
>>> m /= sum(a**2 for a in x) - (sum(x)**2)/len(x)
>>> b = sum(y)/len(y) - m * sum(x)/len(x)
>>> m*5 + b
50

However, not every imperative language has the same level of expressiveness, so I will present a more elaborate version (in JavaScript):

var x = [1,2,3,4];
var y = [10,20,30,40];

function produto(x, y) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * y[i]);
    return ret;
}

function quadrados(x) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * x[i]);
    return ret;
}

function somatorio(x) {
    var ret = 0;
    for ( var i = 0 ; i < x.length ; i++ )
        ret += x[i];
    return ret;
}

function media(x) { return somatorio(x) / x.length; }

var m = somatorio(produto(x,y)) - somatorio(x) * somatorio(y) / x.length;
m /= somatorio(quadrados(x)) - somatorio(x)*somatorio(x) / x.length;

var b = media(y) - m * media(x);

console.log(m*5 + b);

Example in jsFiddle .

 9
Author: mgibsonbr, 2017-04-13 12:59:40

I don't know what algorithm the function TENDENCIA uses, but if it's just an input variable x and an output variable y, the answer to m and b by Least Squares methods is: (code in R)

m <- (sum(x*y) - sum(x) * sum(y) / n) / ( sum(x^2) - ( sum(x) )^2 ) / n )
b <- mean(y) - m * mean(x)

n is the number of elements of the vectors x and y, sum is the function that returns the sum of the vectors, x * y is the vector with the product of the elements of the same index. x^2 is the vector where all its elements are squared

You you can find this solution at wikipedia

 5
Author: Marcos Banik, 2014-02-05 17:32:33

Using the same algorithm as @Marcos Banik's answer , I wrote a small algorithm in C based on the least squares method:

void lms(double *x, double *y, int n, double *m, double *b)
{
    int i;
    double sumYX = 0.;
    double sumX = 0.;
    double sumY = 0.;
    double sumX2 = 0.;
    double sum2X = 0.;
    for(i = 0; i < n; i++) {
        sumYX += x[i] * y[i];
        sumX += x[i];
        sumY += y[i];
        sumX2 += x[i] * x[i];
    }
    sum2X = sumX * sumX;

    *m = (sumYX - (sumX * sumY) / (double)n) / (sumX2 - sum2X / (double)n);
    *b = sumY / (double)n - *m * sumX / (double)n;
}

The algorithm is exactly the same used in his answer. The initial for loop is made to calculate all the summations and later its results are used to find m and b. I believe this is easily translatable to your preferred language. :)

With m and b you have the equation of the line. To calculate what the value of y is, just make a function:

double trend(double m, double b, double x)
{
    return m*x + b;
}

To call functions:

int main(void) 
{
    double x[] = {1., 2., 3., 4.};
    double y[] = {10., 20., 30., 40.};
    double m, b;
    double nx = 5.;
    double ny;

    lms(x, y, 4, &m, &b);
    ny = trend(m, b, nx);

    printf("m: %lf \nb: %lf \nnx: %lf \nny: %lf\n", m, b, nx, ny);

    return 0;
}
 4
Author: user568459, 2017-04-13 12:59:42

For anyone who may be interested here is the implementation I was looking for.

It is written in the Clipper" language " (X/Harbour compiler) and works equal to trend for several values I tested.

Function Main()

   LOCAL x := { 1,2,3,4 }
   LOCAL y := { 10,20,30,40 }

   ? Trend(y, x, 5)

   Return NIL

Function Trend( known_y, known_x, new_x )

   LOCAL v := LSFL(known_y, known_x)

   Return (v[1] * new_x) + v[2]

Function LSFL(known_y, known_x)

   LOCAL M,B
   LOCAL n
   LOCAL x1, y1, xy, x2, J
   LOCAL i

   IF Len( known_y ) != Len( known_x )
      Return { 0, 0 }
   ENDIF

   n := Len( known_y )

   x1 := 0; y1 := 0; xy := 0; x2 := 0

   For i := 1 To n
      x1 := x1 + known_x[i]
      y1 := y1 + known_y[i]
      xy := xy + known_x[i] * known_y[i]
      x2 := x2 + known_x[i] * known_x[i]
   Next

   M := 0
   B := 0

   J := ( n * x2 ) - ( x1 * x1 )

   IF !Empty( J )
      M := (( n * xy ) - ( x1 * y1 )) / J
      B := ( (y1 * x2) - ( x1 * xy )) / J
   ENDIF

   Return { M, B }
 2
Author: Raul Almeida, 2014-02-05 18:17:57

You need y=mx+b and for this you just need the data x. y was used only to discover other variables, but not to define the regression equation, correlation coefficient, mean, median and fashion of the data.

Generic scope: "what year (s) Will absolute population growth be zero?"

First we need to find out the growth rate ( % ) and whether it is positive or negative, what defines TREND or the slope of the graph generated by ER.

Based on existing data (anos/x) we define the correlation coefficient or accuracy of responses for new y.

Halfway we define fashion, median, average etc to support the solution of the problem that allows to work any matrix of x:y depending on the TREND, the sign of bx ormx + or -.

y = a + bx + c ou y=mx+b

The complete code that generates the regression equation and correlation coefficient is in the state of Art by James Holmes in the Chapter 8 sources .

 2
Author: Zé Sobrinho, 2014-07-11 06:07:27

Follows an algorithm I made to calculate the MMQ based on a video lesson I watched on Youtube:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>


void main(){

    float x[] = {-1,0,1,2};
    float y[] = {-1,1,3,5};
    float sx;
    float xy;
    float totalx = 0;
    float totaly = 0;
    float totalsx = 0;
    float totalxy = 0;
    int n = sizeof(x)/sizeof(float);

    int cont;

    for(cont = 0;cont<n;cont++){
        totalx = totalx + x[cont];
        totaly = totaly + y[cont];
        totalsx = totalsx + pow(x[cont],2);
        totalxy = totalxy + (x[cont]*y[cont]);
    }

    // Passo1
    float vbb = n;
    float vba = totalx;
    float somvba = totaly;
    float b2p1 = totalx;
    float a2p1 = totalsx;
    float soma2p1 = totalxy;

    //Passo 2
    float b1p2 = vbb / vbb;
    float a1p2 = vba / vbb;
    float soma1p2 = somvba / vbb;
    float b2p2 = b2p1-(b1p2*totalx);
    float a2p2 = a2p1-(a1p2*totalx);
    float soma2p2 = soma2p1-(soma1p2*totalx);

    // Passo 3
    float b1p3 = b1p2;
    float a1p3 = a1p2;
    float soma1p3 = soma1p2;
    float a2p3 = a2p2 / a2p2;
    float soma2p3 = soma2p2 / a2p2;

    float afinal = soma2p3 / a2p3;
    float bfinal = soma1p3 - a1p3 * afinal;

    printf("\nValor final de A= %.5f e valor de B= %.5f\n\n", afinal, bfinal);
    system("pause >> log");
}
 1
Author: Robson Miranda De Souza, 2015-10-12 23:36:28