The First and Last Thing to do with your Machine Learning Data
Don’t spend all your data at once. Partitioning is key to all data projects.
Imagine starting a new data science project with the goal of building a predictive model from a freshly acquired and cleaned dataset. At first, it may seem tempting to throw all your data into a supervised model, train the model, and then eagerly inspect some quantitative metric to assess the model’s performance. But this would be a mistake. It is much better to first split, or partition, the data into a training and testing set. Although it is one of the first steps, partitioning affects model assessment and ultimately provides greater insight into the behavior of a predictive model.
There are multiple strategies on how to partition but, in general, any strategy is better than none. The simplest scheme is random assignment: some percentage of data records are randomly assigned to the test set and the remainder is assigned to the train set, with a typical test size between 10-20% of the whole. For example, consider a database of image files consisting of bolts, nuts, and washers (amongst other parts).
Full Dataset | Train | Test | ||
bolt | 155 | --> | 125 | 30 |
nut | 58 | --> | 45 | 13 |
other | 109 | --> | 88 | 21 |
washer | 41 | --> | 32 | 9 |
The table above shows the size of each category in the full dataset as well the train and test partitions. Note with each row, the sizes of the train and test set must sum to the full value.
As the name suggests, a machine learning model only learns from the data within the training partition. After training, it is possible to examine quality vis-à-vis each partition. Consider the confusion matrices for the geometry classification problem from above.
The rows represent the actual labels, and the columns represent the predicted labels. This means that the diagonal entries count the number of correct predictions, and anything off the diagonal are incorrect predictions. For example, in the test set, one of washers is incorrectly identified as “other” but the other eight washers in the set are correctly identified. As noted above, the sum of the train and test matrices will equal the corresponding counts in the full matrix. The testing partition is most frequently associated with model performance, but the training data metrics also provide insight.
Training metrics represent how well the model predicts on data it has already seen. As evidenced in the data used here, a model can still predict incorrectly on has data it has previously seen.
Testing metrics indicate how well the model performs on data it had not previously seen. This serves as an estimate to how well a model will perform in the future when it is applied to new inputs.
Both metrics give a rounded view of model behavior, and remind us that predictions are inherently imperfect, even in ideal conditions.
Most machine learning software will either actively or passively integrate partitioning. The images below show partitioning workflows in Altair’s SmartWorks Analytics or HyperStudy.
Regardless of your environment, it is important to understand how partitioning functions in your workflows, for both setup and model evaluation.