How p-values Can Help You Know if You Have Enough Data
People often say “you don’t know what you don’t know” to characterize of working with limited information. But when data is limited, there is a way to know if you’ve learned anything useful.
Simulating physics on a computer is hard! Setting up models is time consuming for the analyst and solving the equations takes a significant amount of computing time. The world of engineering design moves fast, and it is expensive for organizations to run and collect a lot of simulation data. This means that as engineering data scientists, we frequently need to draw inferences from very few data points. One specific challenge is understanding if the predictive models we generate are explaining real phenomena or just learning to reproduce noise from our limited samples.
P-values are a super cool statistical measure that can change your life once you know how to interpret them. Let’s take a simple example to explain the concept. Imagine flipping a coin once a day and recording the results:
Day 1 : Heads
Day 2: Tails
Day 3: Heads
Day 4: Tails
After four days, someone hypothesize that there is a relationship: on odd numbered days the coin will flip heads and on even numbered days it will flip tails. As smart people we may know this to be false, but the data backs up the hypothesis! What would be useful is some kind of number that could assess the probability that this data could have occurred by chance and that it is not cause-effect but from a truly random event. This is the purpose of the p-value! It tells you the statistical likelihood of coming to a false conclusion from your model.
Imagine continuing the above coin example beyond 4 days. One possibility (and the most realistic one in this case) is that the data pattern will change. But, if the same pattern were to continue, it becomes increasingly unlikely for the conclusion to be wrong. When post-processing a Least Squares Regression Fit, the same principle is relevant. Having more data would either change an incorrect conclusion or drive down the p-value which means it is improbable that the relationships between data are due to random events. It is typical in statistics to only consider something with p < 0.05 as significant; this means we are accepting a 5% chance of false positives. (For the truly statistically minded, the p-value in regression is testing the hypothesis that there is no relationship at all between the model’s inputs and outputs – but let’s try to keep this discussion simple.)
The image below shows the p-values from an ANOVA analysis in HyperStudy. Most of the variables in the regression have very low p-values that indicate it is doubtful that there is no relationship between that variable and the predicted quantity. The relatively high p-value (~0.1) of the variable m_1_varname_1 indicates its contribution to the model is likely not due to anything real.
It is important not to confuse p-values with assurances. They are only statistical likelihoods. It is possible to have a small p-value and a poor model - it is just unlikely. When you have many data points, it is more intuitive to have confidence in your model’s predictive quality. In contrast, it is precisely with smaller data sets where the statistical p-value can best help to guide if the data is sufficient to support the conclusion. I’ve seen projects where it is possible to learn from only a handful of very time-consuming simulations. I’d like to hear about the smallest (or largest!) simulation data sets you have used successfully.