【Refer】Balancing Bias and Variance to Control Errors in Machine Learning

In the world of Machine Learning, accuracy is everything. You strive to make your model more accurate by tuning and tweaking the parameters, but are never able to make it 100% accurate. That’s the hard truth about your prediction/ classification models, they can never be error free. In this article I’ll discuss why this happens and other forms of error that can be reduced.

Suppose we are observing a response variable Y (qualitative or quantitative) and input variable X having p number of features or columns (X1, X2…..Xp) and we assume there is relation between them. This relation can be expressed as

Y = f(X) + e

Here f is some ﬁxed but unknown function of X1,…,Xp, and e is a random error term, which is independent of X and has mean zero. In this formulation, f represents the systematic information that X provides about Y. Estimation of this relation or f(X) is known as statistical learning.

In general, we won’t be able to make a perfect estimate of f(X), and this gives rise to an error term, known as reducible error. The accuracy of the model can be improved by making a more accurate estimate of f(X) and therefore reducing the reducible error. But, even if we make a 100% accurate estimate of f(X), our model won’t be error free, this is known as irreducible error(e in the above equation).

In other terms, the irreducible error can be seen as information that X cannot provide about Y. The quantity e may contain unmeasured variables that are useful in predicting Y : since we don’t measure them, f cannot use them for its prediction. The quantity e may also contain unmeasurable variation. For example, the risk of an adverse reaction might vary for a given patient on a given day, depending on manufacturing variation in the drug itself or the patient’s general feeling of well-being on that day.

Such end cases are present in every problem, and the error they introduce is not reducible as generally they are not present in the training data. Nothing that we can do about it. What we can do is reduce other forms of error to get a near perfect estimation of f(X). But first lets take a look at other important concepts in machine learning, which you need to understand in order to proceed further.

Model Complexity

The complexity of a relation, f(X), between input and response variables, is an important factor to consider while learning from a dataset. A simple relation is easy to interpret. For example a linear model would look like this

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

It is easy to infer information from this relation and also it clearly tells how a particular feature impacts the response variable. Such models come under the category of restrictive models as they can take only a particular form, linear in this case. But, a relation may be more complex than this, for example it may be quadratic, circular, etc. These models are more flexible as they fit data points more closely can take different forms. Generally such methods result in a higher accuracy. But this flexibility comes at the cost of interpretability, as a complex relation is harder to interpret.

Choosing a flexible model, does not always guarantee high accuracy. It happens because our flexible statistical learning procedure is working too hard to ﬁnd patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. This changes our estimation of f(X), leading to a less accurate model. This phenomenon is also known as overfitting.

When inference is the goal, there are clear advantages to using simple and relatively inﬂexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. This is when we use more flexible methods.

Quality of fit

To quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation, the most commonly-used measure in regression setting is the mean squared error (MSE),

As the name goes, it is the mean of square of the errors or differences in predictions and observed values for all inputs. It is known as training MSE if calculated using training data, and test MSE if calculated using testing data.

The expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities: the variance of f(x0), the squared bias of f(x0) and the variance of the error terms e. Where, e is the irreducible error, about which we discusses earlier. So, lets see more about bias and variance.

Bias

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. So, if the true relation is complex and you try to use linear regression, then it will undoubtedly result in some bias in the estimation of f(X). No matter how many observations you have, it is impossible to produce an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is highly complex.

Variance

Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a diﬀerent training data set. Since the training data is used to ﬁt the statistical learning method, diﬀerent training data sets will result in a diﬀerent estimation. But ideally the estimate for f(X) should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in f(X).

General Rule

Any change in dataset will provide a different estimate, which is highly accurate, when using a statistical method that tries to match data points too closely. A general rule is that, as a statistical method tries to match data points more closely or when a more flexible method is used, the bias reduces, but variance increases.

Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

In the above image, the left hand side shows a graph of 3 different statistical methods in regression setting. The yellow one is linear, while blue one is slightly non-linear and green is highly non-linear/flexible as it matches data points too closely. In the right hand side you can see a graph of MSE versus flexibility of these three methods. Red one represents test MSE and grey one represents the training MSE. It is not certain that a method with lowest training MSE will also have lowest test MSE. This is because some methods speciﬁcally estimate coeﬃcients so as to minimize the training MSE, but they might not have a low test MSE. This problem can be chalked up to the issue of overfitting. As seen in the graph, the green curve(most flexible) has lowest training MSE but not the lowest test MSE. Lets go a little deeper into this problem.

Credit : ISLR by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

This is a graph showing test MSE(red curve), bias(green curve) and variance(yellow curve), with respect to flexibility of chosen method, for a particular dataset. The point of lowest MSE makes an interesting point about the error forms bias and variance. It shows that with increase in flexibility, bias decreases more rapidly than variance increases. After some point there is no more decrease in bias but variance starts increasing rapidly due to overfitting.

Bias-Variance Trade off

In the above figure, imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target, such that each blue dot represents different realizations of our model based on different data sets for same problem. It displays four different cases representing combinations of both high and low bias and variance. High bias is when all dots are far from bulls eye and high variance is when all dots are scattered. This illustration combined with previous explanation makes the difference between bias and variance pretty clear.

As described earlier, in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. There is always a trade-oﬀ between these values because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by ﬁtting a horizontal line to the data). The challenge lies in ﬁnding a method for which both the variance and the squared bias are low.

Mastering the trade-off between bias and variance is necessary to become a machine learning champion.

This concept should be kept in mind while solving machine learning problems as it helps in improving the model accuracy. Also retaining this knowledge helps you in deciding best statistical models for different situations quickly.

From ： https://towardsdatascience.com/balancing-bias-and-variance-to-control-errors-in-machine-learning-16ced95724db