Reading the Confusion Matrix

Fatma Kocer-Poyraz · July 2022

Classification models should be selected keeping in mind that not all errors are created equal, and F-measure helps with that.

Classification is an unsupervised machine learning method used to predict the likely category a new data point belongs to. Most classification problems are binary problems where a yes/no type of result is being sought. Examples for binary engineering classification problems are predicting whether there is a defect in the part based on material density or predicting whether there is a need for maintenance based on sensor data. Multi-class problems are grouping problems such as predicting which part class a new part belongs to such as tire, wheel, window, door, vehicle body.

Classification model accuracy is judged by the confusion matrix and several metrics derived from it. Confusion matrices not only tell us about the expected error but also where the error occurs. In most classification models, particularly binary classification models, we need to keep in mind that not all errors are created equal. This could be due to class imbalance or that the consequences of prediction error in one class may be more significant than others. This can make picking the best model difficult without an additional measure that makes models of different error causes comparable. This measure is called the F-measure.

		Actual
		Positive	Negative
Predicted	Positive	True Positive	False Positive
Predicted	Negative	False Negative	True Negative

Confusion Matrix for Binary Classification Problems

		Actual
		Defect	No Defect
Predicted	Defect	1	4
Predicted	No Defect	3	40

Example Confusion Matrix for Defected Part Detection

As you can see in image 1, ideal confusion matrix values would be to have non-zero diagonal terms: no false positives or false negatives. However, in practical applications this is not likely to happen due to complexities in the application behavior. In that case the instinct maybe to pick the classification model that has the minimum value for the sum of false positive and false negatives however this will not be optimum in most cases. In the example of part defect detection, erroring on the side of predicting a defect when there is none (false positive) is a safer error than predicting no defect when there is one (false negative). The first will lead to unnecessary physical part testing, in turn increasing testing cost, whereas the latter can lead to failures which have more serious consequences. Hence false negatives should be penalized heavily while training classification models to minimize not only the total error (sum of false positives and negatives) but minimize both total error and the contribution of false negatives to it. The challenge in this case is how to pick between two models that may have reverse total error and false negatives such as one that has a low total error, but high false negative compared to the another that has high total error but low false negative. This is where the F- measure helps by scoring to include both the total error and the false negative in this case and this in return makes model evaluation and selection easier. To understand how F-score is calculated, we first need to understand recall and precision scores.

Recall measures of all the true positives how many were correctly predicted. This is calculated as true positive/(true positive+false negative). In the part defect detection example this value would be 1/(1+3)=0.25. The higher the recall, the better it is.

Precision measures of all the positive predictions, how many are correct. This is calculated as true positive/(true positive+false positive). In the same example this value would be 1/(1+4)=0.20. The higher the precision, the better it is.

And now the F-measure is a composite measure that helps us to compare models with high precision/low recall to low precision/high recall. It is calculated as 2*(Recall*Precision)/(Recall+Precision). In the part defect detection, this number would be 2*(0.25*0.2)/(0.25+0.2)=0.22

Using F-measure selecting classification models is reduced to picking the one with the highest F-measure value, leaving no room for confusion in reading the confusion matrix.

Reading the Confusion Matrix

Categories