The Power of Visual Explanations with Linear Effects
Data science insights come from quantitative information. Presenting the numbers is just as important as calculating them
I have long been fascinated by the idea of information design. I see people such as Edward Tufte and Nate Silver as leaders in the area of providing elegant visual explanations to data. Only in the movie The Matrix can someone claim to make sense of a display of raw numbers. For the rest of us, we rely on graphic representations of quantitative information. This discussion will focus on predictive power of data models based on physics simulations, but the ideas are widely applicable. Specifically, this article will introduce the idea of an effect, how it can be calculated from data, how it is frequently visualized, and how it can be specially visualized within the domain of engineering simulations.
The measure of predictive power allows us to identify which inputs have the most influence on our model. In simple terms: how do the inputs affect the outputs. From the early works of Fisher and Pierce to later practitioners like Taguchi and Box, a core concept in experimental design has been the “effect” of an input variable. This effect is itself a measure of the predictive power of a variable. The “main effect” quantifies the overall impact of an input, averaged across the variation in the other inputs. Frequently this idea is applied in the context of factorial experiments which leads to a tidy algebraic formula such as this for a two-level experiment:
([X2,Y2] + [X2,Y1] – [X1,Y2] – [X2,Y1])/2
For more generalized cases an algebraic expression is not readily available. It can be useful to instead repose the challenge as a simple ordinary least squares regression of the data.
f(x) = A + Bx
where the slope of the line B = Dy/Dx.
The net resultant change, Dy, is the effect. This means we can represent the linear effect via the regression coefficient B:
Dy = BDx
After regressing for each independent variable, the effects can be tabulated into a table. A sortable table is useful to rank the most influential inputs for a given output, as shown here in Altair Knowledge Studio for a different set of predictive power metrics.
Alternatively, this same data can be presented in a graphic known as a Pareto chart. This chart presents the inputs in descending impact and it also makes it clear when the effect has a positive or negative correlation with the output. Fun trivia is that this plot is related to the so called 80/20 rule, also known as the Pareto principle: 80 percent of the effect is explained by only 20% of the variables. The Pareto chart here is from Altair HyperStudy.
This graphic is a richer visualization of the data, but what if we focus on engineering design problems? With this narrowed scope, an engineering data scientist can further present the data less abstractly. In engineering design, our inputs represent some feature of a physical object: a part’s thickness and material, or a shaft’s radius and length, for example. Showing the result as a contour on a 3D model can bring the same data to life. The image below from Altair HyperWorks shows the same data as seen above but visualized on the model with two colors to denote positive and negative correlations and saturation to represent magnitude.
This is just one example of how to present data to engineers using the principles of information design. Data visualization is a fascinating topic that can be as much an art as a science. I’d love to hear about some insightful visualizations you’ve seen in the comments below, even if they are not engineering related.