How to Predict Accurately with Little Data
Simulation and geometry data is often hard to find in engineering data science. Cross validation helps when data is scarce.
Everyone wants to build a high quality machine learning model, and to many people this means an a model with low errors. Another fundamental goal when building predictive models is to avoid overfitting the data. Good data models find balance between high accuracy and generalization to new data. I’ve been involved with a handful of projects that have had to train models with only a handful of time intensive explicit finite element simulations. Best practice requires splitting the data into train, test, and sometimes validation sets, but this partitioning can reduce the amount of data for training by 20-40%. With the comparatively small data sets encountered in the world of computational simulation, many engineering data science applications are already quite data limited before the partitioning ever took place. In these cases, cross validation helps build high quality models with the little data that is available.
Predictive machine learning models learn from training data while the testing data is withheld during the training phase to assess the model’s predictive power. Consider regressing a line through 2 training points, as illustrated in the left image below.
The blue curve represents the model predictions for all values of the input variable x. Because the curve goes through the red training points, there is zero training error. The right image shows the same curve with a green testing point shown in green. Because the predicted curve doesn’t pass through this new datum, there is obviously some testing error when the model is used at a previously unseen point. These testing errors are important to quantify model performance.
The preceding discussion described train and test sets within the scope of training a single machine learning model. However, an additional data set partition is required when comparing multiple predictive models against each other. This validation set serves a similar conceptual purpose to a testing set. Each available candidate machine learning model is trained on test data, then the validation set is used to quantify each model’s performance, and the best model is selected based on its performance on the validation set. As attractive as dedicated validation sets appear, creating one does further reduce the amount of available training data. This is where the technique known as k-fold cross validation can smartly reuse the training data to perform a validation role.
The first step in k-fold cross validation is to segment the training data into k equal sized groups, or folds, where each data point belongs to one, and only one, fold. Every candidate model is built k times, each time withholding one of the folds from training, yet validating the predictive accuracy at the withheld points. These errors at each fold are aggregated together to form an overall performance metric for the given candidate model. The concept is illustrated in the four images below for a linear regression through six data points.
The upper left graph shows the six training points. The remaining three plots depict the process at each of the folds. The fold’s training data is in red, the holdout set is in green, and the associated predictive model is the blue curve. This imagery illustrates the process for a linear regression model, but the same process could be repeated for other predictive models, such as a higher order regression or a neural network. The model that produces the minimum aggregated predictive error on the green holdout points is the best balance between high accuracy and avoiding overfitting.
When data is limited, cross validation repurposes training data to build high quality and generalizable predictive models without diluting the pool of precious training data. When integrated into machine learning processes with automated method selection and parameter tuning, we, at Altair, have found cross validation to be an efficient and reliable technique. Let us know in the comments how you have found success when data is at a premium.