The Important Steps in a Data Science Project
Every successful project starts from a plan. The key steps in data science have a lot in common with the concepts seen in simulation tools.
To me, life is full of seemingly minor events that, in retrospect, end up having an outsized impact on how we think. This post is a story of how one of these specific events has influenced me. I recall many details from my Introduction to Finite Element Analysis class’s first lecture, but especially the professor’s digression on the three main tasks: preprocessing, solving, and post-processing. Each step was unique but interwoven in the overarching analysis, and most importantly each of the steps introduced its own inaccuracies. An effective finite element analyst must be proficient in all three facets. Over time, I realized that I began to see this conceptual three step flow as general pattern in all my activities: something is prepared, something is made, and something is presented. Data science projects are no different, and within data science the steps have the established names of preparation, analysis, and visualization. Just as with finite elements, a successful data scientist needs to master all three areas.
Although the line is often blurry, simulation software offerings typical provide modular solutions to the three-step process. Predictably, the same modularity is seen in data science software. The remainder of this article walks through the parallels between these two worlds using Altair’s Simulation and Data Analytics solutions.
The first step is preprocessing. For simulations like finite elements, this most frequently means importing the raw geometry data and modifying it for the number crunching of the next phase. Time is spent fixing bad geometry, simplifying geometric features that don’t affect the solution, and general organization. It is worth noting that the tasks involved are very labor intensive and consequently there is a high value placed on automation to save time and effort. With data science, this first step often involves tasks like cleaning bad or missing data, ignoring data channels that don’t contribute to further insight, and data engineering to organize disparate sources. The image below shows snapshots of examples of the cleanup steps in both Altair’s HyperMesh and Monarch.
The next step is solving, aka performing the main numerical analysis. For engineering analysts, this means running a physics simulation code such as Radioss for structural dynamics or FEKO for electromagnetics. Each solver comes with a collection of settings and options to tune the solution’s accuracy and efficiency. The corresponding heavy computational analysis in data science often comes in the form of training machine leaning models on the prepared data. While these powerful algorithms learn from data without explicit programming, they do come with their own sets of settings and options that can dramatically affect the results. The use of mass scaling in explicit finite elements and the learning rate of a neural network may have more in common that you previously imagined. Knowledge Studio enables users to quickly, repeatedly, and reliably setup and train machine models with an easy to use workflow diagram interface seen below.
Finally, we get to postprocessing. This is ultimately why we setup and analyzed our problem; we want to see results. More specifically, this means we need specialized tools to visualize the results in a way that gives us insight. The raw outputs of a structural analysis are rarely useful when tabulated into row and columns of displacements and stresses at each node or element. Instead, it is much more valuable to see the analysis come to life by animating it on a contoured 3D geometric representation. Similarly, data analytics results are presented with plots and charts rather than raw numbers such as the training weights of a neural network or polynomial coefficients of a regression model. The image below shows a comparison of the post-processing software offerings HyperView and Panopticon for finite element and data analytics, respectively.
I also mentioned earlier that each step of the process can introduce errors. Finite element modeling errors are quite distinct from the numerical analysis errors. Analogously in the data world, the inaccuracies of a predictive model are conceptually separate from issues with improperly cleaned data. And of course, in either scenario, not knowing what to look for in the visualizations can be its own source of error. To be effective, all those sources of inaccuracies must be understood and minimized to see success.
I mentioned that I see this three step patten everywhere, even in my kitchen: I prepare the meal, cook it, and then present it at the table. Regardless of my silly obsession with finding this pattern, I have found that seeing data analytics through the lens of this same three step process I was taught in school has really helped me understand the role of each software in Altair’s data analytics offering much more clearly. I’d love to hear your thoughts on the data science process and how it can fit into your work.