How Pareto Optimality Can Cut Through the Confusion (Matrix)
Weighting outcomes is a common practice in classification models. Multi-objective optimization may provide more insight in hyperparameter tuning.
I first learned about multi-objective optimization in a university classroom. Initially, I found it hard to accept the validity of a solution that didn’t conform to my view that an optimization should produce a single clear solution. Unlike the single objective case, a multi-objective optimization typically results in a family of trade-off solutions (read here for a more detailed description of Pareto optimality). As part of that collegiate lesson, we students were also shown the common workaround technique of weighting objectives to recast the optimization as a single objective problem. This same weighting technique is sometimes applied to classification problems in machine learning, and I recently investigated how multi-objective optimization could be used instead to gain more insight.
To being, let’s frame the discussion around a problem with two objectives to be minimized, for example mass and cost. The weighting technique creates a hybrid objective function that results in a single minimum value.
In contrast, a true multi-objective formulation will result in the non-dominated Pareto front show here.
The extreme points on the front are solutions to a single objective solution for each respective objective. For example, point 1 is the solution to minimum mass (equivalent to B=0 in the weighted function) and point 2 is the solution to minimum cost (A=0). Conceptually, the locus of points between represent some combination of weights, A and B. It is true that the endpoints of the Pareto front map nicely to specific weightings of the hybrid objective function, but it is not generally possible to map coefficient weights to specific solutions on the front. For example, a hybrid objective function with equal weights does not map to the to the midpoint solution between the endpoints. In practice, casting a multi-objective problem as single weighted objective will result in some point along the front, but which point is not clear. The weights are often chosen from educated guesswork, essentially comparing apples to oranges, resulting in a solution that is equal parts hand waving and scientific rigor. Retaining a true multi-objective formulation provides a more complete picture of what can be achieved.
Now let’s pivot back to classification problems in machine learning. The objective in these problems is minimize the occurrence of false predictions, for the case of binary classifiers this means either false positives (FP) or false negatives (FN). To this end, the training process frequently utilizes objective metrics such as accuracy or F1, both formulas are presented here for review.
From the formulas, it is clear these are simply another case of hybrid objective functions. Each function has its own subtle differences in weightings, and this is completely analogous to the weights in the hybrid single objective function from above. As an extension to this idea, it is possible to apply unequal weights the objectives to bias the model to prefer one inaccuracy over another. As an alternative a weighting scheme to find a single solution, a multi-objective scheme will provide a trade-off between objectives. The image below visualizes the optimal solution space for a trained classifier on the sixty thousand data points in the public Scania truck data set.
Each point represents a Keras model tuned using Altair HyperStudy by varying the model’s hyperparameters such as the number of neurons in the hidden layers, the batch size, and learning rate. The number of false negatives and false positives are plotted on the x and y-axes, respectively, while the corresponding accuracy is indicated by the size and color of the markers. In this specific problem, false positives are a highly undesirable. The highest accuracy models occur when the number of false positives is near 100. Accuracy may not be the best metric for this problem, but accuracy is just like any other weighting metric. A weighted metric will only find one solution among the non-dominated set, but only a multi-objective approach will provide full insight into the complete range of optimal solutions.