Our first supervised machine learning algorithm to walk through is linear regression. Linear regression is a linear approach to modelling the relationship between input variables and output data. Simple linear regression is the case when there is only one input variable while multiple linear regression is the model that is based on more than one input variable.

In this article, we will firstly introduce the mathematics behind linear regression, followed by a complete Python code for applying linear regression to a real-world example.

Mathematics

Imagine we have a set of data with $N$ observations and $M$ dimensions, and each of the observation is associated with one output.

Input: $\lbrace x_1, x_2, ..., x_n \rbrace$, where each $x$ has $M$dimensions

Output: $\lbrace y_1, y_2, ..., y_n \rbrace$

In linear regression, we use a linear function that allow us to predict $y$ from $x$:

$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 +...+ \beta_nx_n$

We can add a bias term, $x_0$, as 1 into equation, so it can be:

$\hat{Y} = X\beta$

In order to find the best fit to the output, we minimize the difference between prediction $\hat{Y}$ and actual $Y$. Normally, we use the least square approach to fit the model but the model can be fitted in other ways.

Squared error $SE = \lVert Y - X\beta \rVert ^ 2$

We will differentiate the error with respect to $\beta$, set it equal to 0 (implying we have a local optimum of the error, minimize the error), and solve for $\beta$.

$\dfrac{\partial}{\partial \beta}SE = \dfrac{\partial}{\partial \beta}(Y-X\beta)^T(Y-X\beta)$

$=\dfrac{\partial}{\partial \beta}(Y^TY - 2\beta X^TY + \beta^TX^TX\beta)$

$=-2X^TY + 2X^TX\beta$

$-2X^TY + 2X^TX\beta = 0$

$X^TX\beta = X^TY$

$\beta = (X^TX)^{-1}X^TY$

You can see we can find the optimal $\beta$ to predict outputs by using the last line of equation. We will illustrate how to implement in a real-world dataset.

Example