The "Karate Kid" Approach to Machine Learning
It pays to start simple.
Here’s a common scenario: you have a world-changing idea that uses machine learning to predict “X”, you gather all the data you can find, feed it all into the hottest new ML algorithm, test… and the predictive quality is absolute garbage. What then?
Machine learning models are seldom plug-and-play technology (especially for new applications), and the sheer quantity of algorithms, hyperparameters, data transformations and quality metrics at our disposal can be, quite frankly, overwhelming.
When I get to this place, I try to channel my inner Mr. Miyagi. Fans of the 1984 film “The Karate Kid” will recall how Mr. Miyagi, the mentor and trainer for a young and ambitious martial arts student named Daniel, insists that Daniel begin his training with mundane, fundamental tasks before he can even attempt the real moves. Whether its karate or data science, the name of the game is to start simple.
Whether it's karate or data science, starting with the fundamentals is often the best approach. Left: scenes from the Karate Kid. Before attempting the "crane kick", Daniel learned fundamentals from waxing a car. Right: a very complex model ("DES-HyperNEAT" by Tenstad and Haddow 2021) and a very basic model (linear regression) for completing the same task.
Towards that end, here are five tips for simplifying the training process for the next time you get stuck:
1. Visualize everything
The ML training workflow is chock-full of abstractions. We often lump training data into giant tensors and forget the numbers inside are physically meaningful. During training we might track one or more quality metrics, which can be a useful diagnostic tool, but they hardly tell the full picture. I have found that visualizing your data at each step in the process is invaluable for finding bugs, diagnosing modeling errors, and understanding the fundamental task at hand. You can try plotting features (transformed and untransformed), labels, predictions, errors, and their distributions. Are there outliers? Are the distributions skewed? Are there any interesting correlations between features, labels, and errors?
2. Try simplifying the data
One way to simplify the problem is to first attempt to learn on some logical subset of the data. For example, if you’re trying to predict the resonant frequency of an oil pan, why not start with pans that have the same mode shape? Can you make any logical partitions of the data, for example, samples that have the same load conditions or materials? By reducing the variance in the data you may make the prediction process easier. You might discover scenarios where your model shines and others where it struggles. Why might that be?
3. Try simplifying the model
Besides simplifying the data, it can also be useful to simplify the model. Sometimes the fanciest new predictive algorithms can be the hardest to get right. Instead of jumping right into deep learning, why not try a more traditional approach? For regression I like to start with least squares. For classification, you might consider a decision tree. In addition to being easy to use and fast to train, these models have the added benefit of being interpretable. Their predictive performance will serve as a baseline for fancier methods, and you might learn something about your problem in the process.
4. Explore data transformations
Sometimes a simple transformation of the data can make the learning process work significantly better. This is somewhat algorithm-dependent. Neural networks, for example, prefer standardized data (zero mean and unit variance). If either your features or labels are skewed, you might consider taking a log transform. Dimensionality reduction techniques like principal component analysis (PCA) are another type of transformation that’s useful if you have correlated features. As you become more familiar with a particular algorithm, you will get a feel for when data transformations are required. Until then, why not try a few?
5. Explore hyperparameters
This tip is last for a reason: tuning hyperparameters is unlikely to make much of a difference if you’re using the wrong data, model, or transformations. That said, there are certainly times when an adjustment of an algorithm’s hyperparameters can make a difference in prediction performance. If/when you do decide to explore hyperparameters, make sure you read about what each one does. You might be able to use your domain knowledge to set reasonable bounds for some of them, or you might determine that some hyperparameters are not worth exploring. I recommend at least starting off with a manual approach before jumping to an automated method like grid search. Doing so will give you a feel for which parameters matter and what effects they appear to have. Again, the more you understand, the better.
Conclusion
Hopefully these tips help you to think about the fundamentals the next time you are stuck on an ML project. Can you think of a time when simplifying the training process proved successful? Do you have any practical tips for debugging the training process?