Thoughts on Error Metrics in Regression

Joseph Pajot
Joseph Pajot
Altair Employee
edited December 2021 in Other Discussion & Knowledge

Error metrics have a big impact when training predictive models.  This simple example helps explain why.

If asked, I think most people could draw a “best fit” line through some data points.  When asked why they drew that specific line, many may respond with some variation of the idea of “minimal error” between the line and the points.   A further smaller subset may yet even be specific enough to talk about reducing the “squared error” between the line and the points.  The distinction between these two answers is likely meaningless in casual conversation, but the difference can be profound in data science.

For discussion, let us begin with the data set shown here: the stress versus radius in a circular cross section under tensile load.

image

Mathematically, the task of curve fitting finds the parameters of an assumed “best fit” function, for this example we can search for the coefficients of a second order polynomial

f(x) = A + B*x + C*x^2

This is where we could reach for our university textbooks to search for the linear algebra formulas to solve for the coefficients, but that familiar solution assumes the goal of reducing the square error.  Minimizing the square error happens to result in a problem that has an algebraic solution.   This convenient solution has remained popular for over 200 years since its discovery partially due to its simplicity.  Selecting a metric other than squared error will not, in general, result in an algebraic solution.  For these cases, we can use an optimizer to determine the optimal coefficient weights.  Naturally, an optimizer can be used for the case of squared error, but it will only inefficiently reproduce the same solution as the algebraic solution.

The image below shows the best fit curve for the mean squared error in black and for the mean absolute error in red.

image

Although similar, the two curves are distinct solutions.  The near equivalence is due to the relatively low error in this data set.  To illustrate this further, the image below shows the same calculations when the left most data point has been perturbed. 

image

In this plot, using the mean square error metric results in significant changes to the black curve which reacts much more strongly to the alteration of the first point.  This result makes logical sense as the large error from the first point is squared, and its influence gives it disproportional influence compared to other’s points comparatively small error contributions.  But without the squaring effect, the red curve is less affected.

Although simplistic, the examples shown here illustrate a general technique to find the “best fit” to data.  In addition to handling general error metrics, optimizers will also work well for undetermined systems (too few observations to unknown parameters) where the algebra solution falters from ill-conditioning.  Deep learning is a prime example of this exact idea in practice at a bigger scale.  These large neural networks require an optimizer to solve for the underdetermined solution of many neuron weights, and the choice of error metrics can impact the solution.  The scale of deep learning can magnify their own inherent complexity. Starting with basic examples, as shown here, can help make the incomprehensible understandable.