Noisy regression

Computer Science Level 4

Real-world data is noisy. When performing an experiment, this data was measured.

If we are expecting the result to be a parabola of the form $y = a\cdot x^2 + b\cdot x + c,$ find $a, b,$ and $c$ that minimize the mean-squared error.

Enter the answer as $a + b + c.$

Note: There are tools that make this job automagically like origin, but we are here for the learning purpose and it would be a good exercise to implement this at least once to understand how those tools work.

The answer is 14.021535716869955.

2 solutions

Mark Hennings
Nov 23, 2018

If we want to minimize $F(a,b,c) \; = \; \sum_{j=1}^n\big(y_j - ax_j^2 - bx_j - c)^2$ then we require $\frac{\partial F}{\partial a} \; = \; \frac{\partial F}{\partial b} \; =\; \frac{\partial F}{\partial c} \; = \; 0$ and hence $\left(\begin{array}{ccc} E[x^4] & E[x^3] & E[x^2] \\ E[x^3] & E[x^2] & E[x] \\ E[x^2] & E[x] & 1 \end{array}\right)\left(\begin{array}{c} a \\ b \\ c \end{array}\right) \; = \; \left(\begin{array}{c} E[x^2y] \\ E[xy] \\ E[y] \end{array} \right)$ where $E[x^m] \; = \; \frac{1}{n}\sum_{j=1}^nx_j^m \hspace{2cm} 1 \le m \le 4$ is the average value of $x^m$ for the data, and $E[x^my] \; = \; \frac{1}{n}\sum_{j=1}^m x_j^m y_j \hspace{2cm} 0 \le m \le 2$ is the average value of $x^my$ for the data (note that $n$ is the number of data points - in this case, $n=10000$ ).

Calculating these expected values can be done easily in Excel, and simple Gaussian elimination solves the equation, giving $a = 1.003089322$ , $b = 3.132961855$ , $c = 9.88548454$ , obtaining $a+b+c = \boxed{14.02153572}$ .

Hello sir, very nice solution, though in the last equation, shouldnt the letter above the sigma be n instead of m?

Mehdi K. - 2 years, 1 month ago

João Areias
Nov 22, 2018

It is well known that if we have 3 points we can define a parabola with the following matrix

$\left[\begin{matrix} y_0 \\ y_1 \\ y_2 \end{matrix}\right] = \left[\begin{matrix} x_0^2 && x_0 && 1 \\ x_1^2 && x_1 && 1 \\ x_2^2 && x_2 && 1 \\ \end{matrix}\right] \cdot \left[\begin{matrix} a \\ b \\ c \end{matrix}\right]$

or more commonly written as $A\vec x = \vec b$ where

$A = \left[\begin{matrix} x_0^2 && x_0 && 1 \\ x_1^2 && x_1 && 1 \\ x_2^2 && x_2 && 1 \\ \end{matrix}\right] \text{, } \vec x = \left[\begin{matrix} a \\ b \\ c \end{matrix}\right] \text{, } \vec b = \left[\begin{matrix} y_0 \\ y_1 \\ y_2 \end{matrix}\right]$

and we can calculate $\vec x$ by computing $\vec x = A^{-1} \vec b$ .

Well, we do have three points but we have even more than 3 points but, because the data is noisy, is not likely that any 2 triplets will fit to the same parabola, and if we are to include all points, our matrix $A$ is not square, so we can't invert it. It turns out that there is a notion of a pseudo-inverse that can be applied to $m \times n$ matrix and when applied it has the nice property of minimizing the squared error and it is defined by $(A^T A)^{-1} A^T$ so we can compute $\vec x$ by computing $\vec x = (A^T A)^{-1} A^T \vec b$ . Here is my python implementation of it with the plot of the data and the regression line:

from numpy.linalg import *
import matplotlib.pyplot as plt
import numpy as np

x, y = np.loadtxt("data.txt").T

A = np.array([x**2, x, x**0]).T
pseudo_inv = np.matmul(inv(np.matmul(A.T, A)), A.T)
regres = np.matmul(pseudo_inv, y)

f = np.poly1d(regres)
_y = f(x)

print(regres)
print(sum(regres))

plt.plot(x, y, 'ro', markersize=1)
plt.plot(x,_y, 'b-', markersize=1)
plt.show()

It is worth noting that, with your matrices $A \; = \; \left(\begin{array}{ccc} x_1^2 & x_1 & 1 \\ x_2^2 & x_2 & 1 \\ \cdots & \cdots & \cdots \\ x_n^2 & x_n & 1 \end{array}\right) \hspace{2cm} b \; = \; \left(\begin{array}{c} y_1 \\ y_2 \\ \cdots \\ y_n \end{array} \right)$ then $nA^TA \; = \; \left(\begin{array}{ccc} E[x^4] & E[x^3] & E[x^2] \\ E[x^3] & E[x^2] & E[x] \\ E[x^2] & E[x] & 1 \end{array}\right) \hspace{2cm} nA^Tb \; = \; \left(\begin{array}{c} E[x^2y] \\ E[xy] \\ E[y] \end{array} \right)$ and so your method is in fact the same as mine. This explains why your method works.

Mark Hennings - 2 years, 6 months ago

Pretty much, I loved your approach to it but yeah, they are in fact equivalent.

João Areias - 2 years, 6 months ago

Noisy regression

The answer is 14.021535716869955.

2 solutions

0 pending reports