What Exactly is Historical Data?
Historical data means different things for different applications. Before you start your journey of engineering data science with historical data, take time to define what your historical data contains.
When we think about machine learning, data science, or artificial intelligence; the next terms that immediately come to mind are big data, data lakes, and historical data. One of those, historical data, is one that comes often in the context of product design. It is also a term that is easy to use without a clear definition. No clear definition leads to misaligned expectations. So today I want to list some of our findings about historical data in the hopes that it may help you to define your use of historical data and help you set correctly aligned expectations.
The promise of machine learning is to learn from experience, hence the need to use historical data. The promise is to learn from that collective information, not to repeat the same mistakes and instead move forward fast. But what exactly does your historical data contain and what can it do for you? Below are some questions we learned to ask when we talk about historical data for product design:
- Timeline: How historical the data needs to be? Are we talking about last week’s data? Is that enough? Are we thinking of last year’s data? Is that still relevant? This is a comprise you have to decide. If your data source is simulations, you have to be aware of changes in the solution schemes that may introduce artificial noise in your data. If you no longer use the materials you used 5 years ago, data from before may no longer be relevant.
- Variety: What data should be considered? If we work on automotive chassis design, will it benefit or hurt us to include data from other parts, industries, or physics? My answer is until the predictive methods are mature enough, we do not benefit from that type of variety, stick to the data that is of the same part, industry, and physics. The question does not yet end. If we work on chassis, is chassis data for all vehicle types relevant? Is it only the type that we are focusing on? Do we care for different chassis designs or should we focus on the data collected from the one we have? The answer to this depends on your what you are changing and what you are trying to achieve. If you want to design your current application quicker, limit your data to the part you are working on. If you want to train a ML model that can help you iterate concepts faster, you do need to include data from multiple similar parts. Remember the famous George Box quote “All models are wrong; some models are useful”. You are trying to train a model that makes good enough predictions to meet your objectives, you are not trying to train a model that makes some predictions for all objectives.
- Source: Is your data from simulations, test, operations? You want to learn from all of them at once? Different sources have different attributes, different noise factors so my answer to this question would be that there are not enough studies to show how data from different sources can be merged and be made useful. So, I would suggest holding on to this thought until methods to work with engineering data matures.
- Origin: Once we decide on the timeline, variety, and source, the next thing to consider is the origin of the variety in the data. Are they trial and error design iterations? If they are, the data is derived from a single design tweaked multiple ways, possibly with only small changes. Are they coming from a single design of experiments? Data from DOEs will most likely have a better coverage of the design space leading to more accurate predictive models. This can get further complicated if you have multiple configurations, each with their own set of design variation.
- Goals: What are your objectives? Are you looking to predict very similar designs (interpolation), similar topology but unseen dimensions (extrapolation), or explore different topologies (unseen designs)? Predictive models are made to meet the first objective and second objective is a stretch to most. As for the last objective of predicting unseen topologies, I would recommend against it.
- Size: After you collect the data with the relevant timeline, enough variety, correct source, and origin; do you have enough to meet your objectives? If your objective is to do fast real-time predictions, do you have a good coverage of the design space? If your objective is to innovate, do you have enough range? If the answers are no, you can augment your datasets with simulations. If this is not an option, you can start your process and learning with the data you have and continuously improve it as you collect data. Make sure you are aware of the shortcomings of your predictions with small sets of data.
Some Historical Data Types
As we continue to work in engineering data science, we will continue to learn to differentiate even more aspects of what exactly historical data means. Now I would like to ask you the same question. What exactly is your historical data?
- Design iterations of the same design
- Design iterations for an application group
- DOEs
- No collected data yet
- Test data
- Other