Machine Learning Overview Series (2) - Linear Regression | Bangda Sun

Linear Regression (LR) in practice includes regression and classification (apply sigmoid), in statistics it is purely for regression. With strong mathematical assumptions on data, linear regression is still a powerful tool and can be applied in every regression task.

1. Introduction

I believe this would be the first model that you know when you start with machine learning, and I believe the space here will not be sufficient for me to illustrate its full applications and to explain why it’s widely used.

2. Mathematical Basis

Linear regression is a statistical model, more specifically, it is a parametric model - this means it has some assumptions (probability distributions) about the data it models.

2.1 Linear Regression

The form of linear regression model is
\[
y = w_{0} + w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{p}x_{p} + \epsilon,
\]
where we assume the relationship between predictors (features) and response (target) are linear, the \(w\)’s are called coefficients (parameters). If we have a dataset with \(n\) observations and \(p\) predictors, linear regression is build in this form:
\[
\begin{bmatrix}
y_{1}\\
y_{2}\\
\vdots\\
y_{n}
\end{bmatrix} =
\begin{bmatrix}
1&x_{11}&x_{12}&\cdots&x_{1p}\\
1&x_{21}&x_{22}&\cdots&x_{2p}\\
\vdots&\vdots&\vdots&\ddots&\vdots\\
1&x_{n1}&x_{n2}&\cdots&x_{np}
\end{bmatrix}
\begin{bmatrix}
w_{0}\\
w_{1}\\
w_{2}\\
\vdots\\
w_{p}
\end{bmatrix} +
\begin{bmatrix}
\epsilon_{1}\\
\epsilon_{2}\\
\vdots\\
\epsilon_{n}
\end{bmatrix}\text{, or } \mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{\epsilon}.
\]

where \(\epsilon\) is known as random error, or noise, which cannot be observed, it includes all of factors that cannot be explained by the predictors.

1.2 Model Assumptions

The basic assumptions of linear regression includes

\(\epsilon\) is i.i.d. with zero mean fixed variance \(\sigma^{2}\) Normal distribution, \(\epsilon\sim N(0, \sigma^{2})\);
\(\epsilon\) is independent with \(\mathbf{X}\);

Based on the first assumption and the property of Normal distribution, we can derive that the target has an implicit assumption: \(\mathbf{y}\sim N(\mathbf{X}\mathbf{w}, \sigma^{2})\).

1.3 Parameters Estimation

1.3.1 Ordinary Least Squre

No differences with other supervised learning problem, to estimate the coefficients we need to minimize the loss function of model. Usually we use squared loss since it has good property (discuss later in this post), sometimes other loss functions like absolute loss is also used when we hope the model is less sensitive to outliers. If we choose squared loss, we are actually on the track of ordinary least square (OLS), i.e. reduce RSS (Residual Sum of Squares) ,

\[
\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \sum^{n}_{i=1}(y_{i} - w_{0}-w_{1}x_{i1}-\cdots-w_{p}x_{ip})^{2} = \arg\min_{\mathbf{w}} (\mathbf{y}-\mathbf{X}\mathbf{w})^\top(\mathbf{y}-\mathbf{X}\mathbf{w}),
\]

and we can solve it explicitly,

\[
\begin{aligned}
L(\mathbf{w}) =~& \left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)^\top\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right) \\
=~& \mathbf{y}^\top\mathbf{y} - 2\mathbf{X}^\top\mathbf{y}\mathbf{w}^\top + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w},
\end{aligned}
\]

take the gradient and set it equal to 0,

\[
\begin{aligned}
&\nabla_{\mathbf{w}} L(\mathbf{w}) = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w} = 0 \\
\Longrightarrow~& \mathbf{X}^\top\mathbf{X}\mathbf{w} = \mathbf{X}^\top\mathbf{y} \\
\Longrightarrow~& \mathbf{w} = \left(\mathbf{X}^\top\mathbf{X}\right)^{-1}\mathbf{X}^\top\mathbf{y}.
\end{aligned}
\]

And notice that, this requires \(\mathbf{X}^\top\mathbf{X}\) is invertible, \(n\geq p\).

1.3.2 Maximum Likelihood Estimates

Think about why we use squared loss? why it is different with absolute loss? Well let’s first throw away OLS, in statistics MLE (Maximum Likelihood Estimates) is always one of answers.

We already assume that the error term has normal distribution, i.e. \(\epsilon\sim N(0,\sigma^2)\), and \(\mathbf{X}\) is fixed and independent with error term, hence \(y_{i}\) would also be Normal distribution, \(y_{i}\sim N(\mathbf{x}_{i}^\top\mathbf{w}, \sigma^{2})\). Also all observations are independent, therefore we get
\[
\begin{aligned}
\hat{\mathbf{w}}_{mle} &~= \arg\max_{\mathbf{w}}f(\mathbf{y}, \mathbf{X}) = \arg\max_{\mathbf{w}}\prod^{n}_{i=1}f(y_{i},\mathbf{x}_{i}) = \arg\max_{\mathbf{w}}\prod^{n}_{i=1}f(y_{i}|\mathbf{x}_{i})f(\mathbf{x}_{i}) \\
&~= \arg\max_{\mathbf{w}}\prod^{n}_{i=1}\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{(y_{i}-\mathbf{x}_{i}^\top\mathbf{w})^{2}}{2\sigma^{2}}\right\} \\
&~= \arg\max_{\mathbf{w}} -\sum^{n}_{i=1}(y_{i}-\mathbf{x}_{i}^\top\mathbf{w})^{2} \\
&~= \arg\min_{\mathbf{w}} \sum^{n}_{i=1}(y_{i}-\mathbf{x}_{i}^\top\mathbf{w})^{2}
\end{aligned}
\]
see that? under this assumption, \(\hat{\mathbf{w}}_{ols} = \hat{\mathbf{w}}_{mle}\). That’s one of the perspective to explain why we would like use squared loss in linear regression, and keep in mind that this is based on our assumptions on \(\epsilon\).

1.3.3 Numerical Methods

Another commonly used methods to estimate parameters is using gradient descent-ish algorithm to find the minimal of loss function, given gradient of squared loss function is

\[
\nabla_{\mathbf{w}} L(\mathbf{w}) = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}
\]

therefore the update function for \(\mathbf{w}\) is

\[
\mathbf{w}_{t+1} = \mathbf{w}_{t} - \alpha\nabla_{\mathbf{w}} L(\mathbf{w}) = \mathbf{w}_{t} - \alpha\left(-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}_{t}\right).
\]

where \(\alpha\) is the learning rate.

3. Practical Issues

Given linear regression is a simple parametric model, its assumptions are usually hard to achieve.

3.1 How good the model it is?

There are multiple statistical tests can be used to give probabilistic statements about the estimated parameters:

t-test: whether single coefficients \(w_{i}\) is statistical significant? In other words, is \(x_{i}\) (feature \(i\)) useful in predicting \(y\)?
F-test: whether multiple coefficients are statistical significant?

But usually people (non-statistics guys) just care about the RMSE on test data…

3.2 Dependences in Noise

\(\epsilon\) are still from the same distribution but not independent, this will invalidate the results of t-test and F-test since they require the sample to be i.i.d. The cause for this could be:

time series data
duplicate data (e.g. double the data by accident)

3.3 Multi-Collinearity

This means some features are highly correlated. Consider the extreme case that there are two identical features, this will make \(\mathbf{X}^\top\mathbf{X}\) not a full rank matrix, therefore the solutions \(\mathbf{w}\) will be unstable, which is possible to give counter-intuitive results in interpreting model.

3.4 Outliers

Given the loss function of linear regression is squared loss by “default”, it will be more sensitive to extreme large values. Removing outliers or log transformation could make the model more robust.

4. Implementation

We can estimate the coefficients both explicitly and numerically. For numerical computing, we will use gradient descent again to find the minima of the loss function, here is implementation I did.

And in R we have lm() and in sklearn.linear_model we have LinearRegression().