Bias and Variance Trade-Off

Bias and variance of Models.

Bias-variance trade-off is one of the most important machine learning ideas that every data scientist should keep in mind. We’ve already got several models results available for a specific problems, one question is how to compare those models.

Sometimes machine learning is results-oriented, the algorithm that could achieve lower error would be the best choice. And the principal behind this is Bias-Variance trade-off.

We know the accuracy of an algorithm is given by the loss function \(L(y, \hat{y})\), and in practice we usually learn the parameters via minimizing empirical loss

\[
\mathbb{E}[L(y,\hat{y})]\approx\frac{1}{n}\sum^{n}_{i=1}L(y_{i}, \hat{y_{i}}).
\]

Our next goal is try to decompose this quantity. To make our life easier, I assume we are facing regression problem and the loss function is squared loss. Therefore we want to minimize MSE (Mean Squared Error). Suppose we’ve fit a function \(\hat{y} = \hat{f}(x)\), for a new data \((\tilde{x}, \tilde{y})\), we want the expected MSE to be the minimum,
\[
\mathbb{E}\left[(\tilde{y} - \hat{f}(\tilde{x}))^2\right].
\]

Then we can apply a commonly used trick to decompose this

\[
\begin{aligned}
&~\mathbb{E}\left[(\tilde{y} - \hat{f}(\tilde{x}))^2\right] \\
=&~\mathbb{E}\left[(\tilde{y} - \mathbb{E}[\hat{f(\tilde{x})}] + \mathbb{E}[\hat{f(\tilde{x})}] - \hat{f}(\tilde{x}))^2\right] \\
=&~\mathbb{E}\left[(\tilde{y} - \mathbb{E}[\hat{f}(\tilde{x})])^2\right] + \mathbb{E}\left[(\hat{f}(\tilde{x}) - \mathbb{E}[\hat{f}(\tilde{x})])^2\right] \\
=&~\text{Bias}\left(\hat{f}(\tilde{x})\right) + \text{Var}\left(\hat{f}(\tilde{x})\right),
\end{aligned}
\]

yeah, you can see we get the bias and variance term successfully. And you can also see that a good model doesn’t mean to achieve low bias or low variance, it should achieve low error that consist of bias and variance, which means achieve low bias and low variance simultaneously.

Here is the informal definition of bias and variance:

Bias: Error introduced by approximating complicate model using a simpler model, or say: the deviation of average prediction from true value;
Variance: Amount by which \(\hat{f}(\tilde{x})\) would change if we estimate it use a different (training) data, or simply say: the variance of estimates.

In general more flexible models / algorithms will have higher variance but low bias since they focus more on reducing bias, e.g. Boosting.

Theoretically bias and variance won’t increase or decrease at the same time. The total error is not a constant value, there is a minimial total error exists.