Determining the Root of R-squared
The coefficient of determination is commonly used, but perhaps comparatively less understood. Two simple graphical illustrations can help interpret its meaning.
R2 is the comfort food of regression analysis. I feel we all keep coming back to it yet prefer not to think too hard about where it comes from or if it any good for us. After building a model, it is natural to want an ex post quantification of quality, i.e. is the model predicting accurately or do we need to try again? Using R2 is a workhorse way to measure goodness of fit, but just like the Latin phrases from the last sentence, knowing what it means is vital to using it effectively.
One useful visual interpretation is most closely associated with an ordinary least squares regression model. Given a set of points, the regression identifies the set of coefficients that provide a best fist function to the data. At each data point it is possible to draw two boxes as shown in the image below.
The orange box has side lengths equal to the model’s predictive error at the that point. The green box has side lengths equal to the difference between the data point and the statistical mean of the data set. When the orange box is small compared to the green box, it indicates the model predicts well compared to using only the mean as a prediction. By taking the cumulative sum of the orange and green box area and then calculating their ratio, it is possible to write the formula for R2 as
For accurate models, the orange boxes become small and the metric approaches an upper limit of 1.0. Inaccurate models have a larger orange boxes and R2 will decrease. For ordinary least squares regression, the theoretically least accurate model is the mean itself, therefore in this case it is bounded by a lower limit of 0.0.
A second helpful visualization is derived from interpreting the coefficient of determination as the squared value of the correlation between the dependent variable and it’s predicted value. Because the correlation coefficient is typically represented with the symbolic variable r, this is the source of the name “R-squared”. The image below shows the scatter between the actual data and its predicted values. The right plot for F2 has a tighter correlation than the left plot for F1: 0.81 versus 0.97 for comparison.
For a perfect correlation of 1.0, the points would all align on a 45 degree angle. From this visualization it is easier to conceive a correlation so poor that it becomes negative. As mentioned above, negative values of R2 are not possible with ordinary linear regression but it is quite possible with other regression techniques, e.g. Gaussian process regression. The interpretation of the coefficient of determination as the square of the correlation has obvious shortcomings once the correlations become negative: how does a squared value result in a negative? Regardless of this limitation, the interpretation of all negative R2 values is the same: the predictive model is so poor that a simple mean would be more accurate!
I find that like many other data science concepts, R2 seems complex but can be more easily understood with illustrative interpretations. Marking up simple plots and charts can aid comprehension. Please share any similar experiences where “a picture is worth 1000 words”, or maybe more honestly, an area where you struggle and can only wish you had that picture.