How to implement a linear regression algorithm?
I need to implement a linear regression algorithm. Preferably it gives results equal to or close to the function trend (or TREND) of Excel.
I've found a lot of material that tries to explain the whole concept of linear regression, but that's not what I want. I would like a pseudo code that is directly translatable to an imperative programming language.
For those who did not use the function trend it works like this:
Given two vectors represent the x and the y in an ordered pair, (x,y) to that of the well-known values, for example, [ (1,10), (2,20), (3,30), (4,40) ], a value of x for which you want to find out the value of y corresponding to a function of TREND fits and the known values in the function, of the form y=mx+b, , and of the value of y. in This case, if I pass the value of 5 as the last argument to the function, I return 50.
Is anyone able to give a pseudo code of an algorithm that does this?
7 answers
I put together a small example in Java based on some examples I found on the Web.
I used the data you provided...
public class Main {
public static void main(String[] args) {
double[] x = {1, 2, 3, 4};
double[] y = {10, 20, 30, 40};
System.out.println(trend(y, x, 5));
}
public static double trend(double[] known_y, double[] known_x, double new_x)
{
double[] values = LeastSquaresFitLinear(known_y, known_x);
return (values[0] * new_x) + values[1];
}
public static double[] LeastSquaresFitLinear(double[] known_y, double[] known_x)
{
double M, B;
if (known_y.length != known_x.length)
{
return new double[]{0,0};
}
int numPoints = known_y.length;
double x1, y1, xy, x2, J;
x1 = y1 = xy = x2 = 0.0;
for (int i = 0; i < numPoints; i++)
{
x1 = x1 + known_x[i];
y1 = y1 + known_y[i];
xy = xy + known_x[i] * known_y[i];
x2 = x2 + known_x[i] * known_x[i];
}
M = B = 0;
J = ((double)numPoints * x2) - (x1 * x1);
if (J != 0.0)
{
M = (((double)numPoints * xy) - (x1 * y1)) / J;
B = ((y1 * x2) - (x1 * xy)) / J;
}
return new double[]{M,B};
}
}
The answer from @Marcos Banik can be translated for example to Python with minimal effort:
>>> x = [1,2,3,4]
>>> y = [10,20,30,40]
>>> m = sum(a*b for (a,b) in zip(x,y)) - sum(x) * sum(y) / len(x)
>>> m /= sum(a**2 for a in x) - (sum(x)**2)/len(x)
>>> b = sum(y)/len(y) - m * sum(x)/len(x)
>>> m*5 + b
50
However, not every imperative language has the same level of expressiveness, so I will present a more elaborate version (in JavaScript):
var x = [1,2,3,4];
var y = [10,20,30,40];
function produto(x, y) {
var ret = [];
for ( var i = 0 ; i < x.length ; i++ )
ret.push(x[i] * y[i]);
return ret;
}
function quadrados(x) {
var ret = [];
for ( var i = 0 ; i < x.length ; i++ )
ret.push(x[i] * x[i]);
return ret;
}
function somatorio(x) {
var ret = 0;
for ( var i = 0 ; i < x.length ; i++ )
ret += x[i];
return ret;
}
function media(x) { return somatorio(x) / x.length; }
var m = somatorio(produto(x,y)) - somatorio(x) * somatorio(y) / x.length;
m /= somatorio(quadrados(x)) - somatorio(x)*somatorio(x) / x.length;
var b = media(y) - m * media(x);
console.log(m*5 + b);
I don't know what algorithm the function TENDENCIA
uses, but if it's just an input variable x
and an output variable y
, the answer to m
and b
by Least Squares methods is: (code in R)
m <- (sum(x*y) - sum(x) * sum(y) / n) / ( sum(x^2) - ( sum(x) )^2 ) / n )
b <- mean(y) - m * mean(x)
n
is the number of elements of the vectors x
and y
,
sum
is the function that returns the sum of the vectors,
x * y
is the vector with the product of the elements of the same index.
x^2
is the vector where all its elements are squared
You you can find this solution at wikipedia
Using the same algorithm as @Marcos Banik's answer , I wrote a small algorithm in C based on the least squares method:
void lms(double *x, double *y, int n, double *m, double *b)
{
int i;
double sumYX = 0.;
double sumX = 0.;
double sumY = 0.;
double sumX2 = 0.;
double sum2X = 0.;
for(i = 0; i < n; i++) {
sumYX += x[i] * y[i];
sumX += x[i];
sumY += y[i];
sumX2 += x[i] * x[i];
}
sum2X = sumX * sumX;
*m = (sumYX - (sumX * sumY) / (double)n) / (sumX2 - sum2X / (double)n);
*b = sumY / (double)n - *m * sumX / (double)n;
}
The algorithm is exactly the same used in his answer. The initial for
loop is made to calculate all the summations and later its results are used to find m
and b
. I believe this is easily translatable to your preferred language. :)
With m
and b
you have the equation of the line. To calculate what the value of y is, just make a function:
double trend(double m, double b, double x)
{
return m*x + b;
}
To call functions:
int main(void)
{
double x[] = {1., 2., 3., 4.};
double y[] = {10., 20., 30., 40.};
double m, b;
double nx = 5.;
double ny;
lms(x, y, 4, &m, &b);
ny = trend(m, b, nx);
printf("m: %lf \nb: %lf \nnx: %lf \nny: %lf\n", m, b, nx, ny);
return 0;
}
For anyone who may be interested here is the implementation I was looking for.
It is written in the Clipper" language " (X/Harbour compiler) and works equal to trend for several values I tested.
Function Main()
LOCAL x := { 1,2,3,4 }
LOCAL y := { 10,20,30,40 }
? Trend(y, x, 5)
Return NIL
Function Trend( known_y, known_x, new_x )
LOCAL v := LSFL(known_y, known_x)
Return (v[1] * new_x) + v[2]
Function LSFL(known_y, known_x)
LOCAL M,B
LOCAL n
LOCAL x1, y1, xy, x2, J
LOCAL i
IF Len( known_y ) != Len( known_x )
Return { 0, 0 }
ENDIF
n := Len( known_y )
x1 := 0; y1 := 0; xy := 0; x2 := 0
For i := 1 To n
x1 := x1 + known_x[i]
y1 := y1 + known_y[i]
xy := xy + known_x[i] * known_y[i]
x2 := x2 + known_x[i] * known_x[i]
Next
M := 0
B := 0
J := ( n * x2 ) - ( x1 * x1 )
IF !Empty( J )
M := (( n * xy ) - ( x1 * y1 )) / J
B := ( (y1 * x2) - ( x1 * xy )) / J
ENDIF
Return { M, B }
You need y=mx+b
and for this you just need the data x
. y
was used only to discover other variables, but not to define the regression equation, correlation coefficient, mean, median and fashion of the data.
Generic scope: "what year (s) Will absolute population growth be zero?"
First we need to find out the growth rate ( % ) and whether it is positive or negative, what defines TREND or the slope of the graph generated by ER.
Based on existing data (anos/x
) we define the correlation coefficient or accuracy of responses for new y
.
Halfway we define fashion, median, average etc to support the solution of the problem that allows to work any matrix of x:y
depending on the TREND, the sign of bx
ormx
+
or -
.
y = a + bx + c ou y=mx+b
The complete code that generates the regression equation and correlation coefficient is in the state of Art by James Holmes in the Chapter 8 sources .
Follows an algorithm I made to calculate the MMQ based on a video lesson I watched on Youtube:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
void main(){
float x[] = {-1,0,1,2};
float y[] = {-1,1,3,5};
float sx;
float xy;
float totalx = 0;
float totaly = 0;
float totalsx = 0;
float totalxy = 0;
int n = sizeof(x)/sizeof(float);
int cont;
for(cont = 0;cont<n;cont++){
totalx = totalx + x[cont];
totaly = totaly + y[cont];
totalsx = totalsx + pow(x[cont],2);
totalxy = totalxy + (x[cont]*y[cont]);
}
// Passo1
float vbb = n;
float vba = totalx;
float somvba = totaly;
float b2p1 = totalx;
float a2p1 = totalsx;
float soma2p1 = totalxy;
//Passo 2
float b1p2 = vbb / vbb;
float a1p2 = vba / vbb;
float soma1p2 = somvba / vbb;
float b2p2 = b2p1-(b1p2*totalx);
float a2p2 = a2p1-(a1p2*totalx);
float soma2p2 = soma2p1-(soma1p2*totalx);
// Passo 3
float b1p3 = b1p2;
float a1p3 = a1p2;
float soma1p3 = soma1p2;
float a2p3 = a2p2 / a2p2;
float soma2p3 = soma2p2 / a2p2;
float afinal = soma2p3 / a2p3;
float bfinal = soma1p3 - a1p3 * afinal;
printf("\nValor final de A= %.5f e valor de B= %.5f\n\n", afinal, bfinal);
system("pause >> log");
}