Meet your New Best Friend, the Parallel Coordinate Plot
Parallel coordinate plots are an excellent way to explore high dimensional data sets. These highly interactive plots can literally change the way you look at data.
Most real world data sets greatly exceed the complexity of those used in learning exercises. This complexity can come in many forms, but one of the most basic forms is sheer size. Academic problems with only a handful of channels of information are a rare sight in the wild. Digging into data with even a dozen columns of information is a challenge and finding the right tools for this task is critical. 2D scatter plots are obviously limited to viewing only two channels simultaneously, and speaking for myself, its higher order cousins like 3D scatter and bubble plots are quite taxing on my brain. Over a decade ago I was introduced to parallel coordinate plots. To be honest? I didn’t like them, at all. In this post, I’d like to share why I eventually changed my mind.
These plots are unimpressive when all you are seeing is a static image of a parallel coordinate plot. Ironically, this post will be using static images to illustrate, but I think it will be sufficient to do them justice. To begin, let us go over the basics of a parallel coordinate plot.
At first glance, this plot seems confusing and difficult to interpret. But breaking the dense presentation down into its pieces is key to understanding the value, starting with describing how dimensions are represented. Every channel of the plot is represented as a vertical line. For example, IV1 is the first variable channel in the image. Located on the far left of the plot, this variable ranges from a minimum value of 30.48 at the bottom of the vertical axis to a maximum value of 88.56 at the top. The second channel, IV2, ranges from 60.96 to 179.04. This type of plot can work with as little as two dimensions, but scales well in many dimensions -- theoretically without limit! When used to explore data, a good implementation of parallel coordinate plot should allow the user to interactively change the displayed channels. In the image above, this is controlled by the channel selector on the right hand side. The image below shows the same data represented with only the 1st, 2nd, and 8th channels selected.
The next step is to explain the colorful lines. Each line represents one of the data records. A line intersects each vertical line at that coordinate’s corresponding value. In the image below, the highlighted black line has a high value of the variable IV1, a lower value of the variable IV1, and a high value of the variable KPI1.
These images contain only 80 data records, yet the plot is already dense with many lines weaving together. Once again, interactive features such as line highlighting is vital to be able to see how data snakes through the coordinate lines. But even highlighting has limits. The final interactive piece of a parallel coordinate implementation is the most useful for data exploration. Placing filters will simplify the data representation to only view the records that match your requirements. For example, the image below shows a filter applied to the channel IV2. Notice that only a subset of the previous records remain visible after the masking filter is drawn with a mouse.
Now that the important features of the plot have been covered, let us return our minds to the full 15 channel problem. Furthermore, let’s us assume this data represents a problem where low values of channel KPI8 are desirable.
At a quick glance of this plot, a few key takeaways about this dataset are immediately evident. First, the horizontal lines between KPI1 and KPI2 indicate a strong linear dependence; the channels are redundant and it’s likely one is a proportional multiple of the other. The filter also makes it clear that records with a low value of KPI8 have only higher values of IV2. This is indicative of a strong negative correlation between the two channels: as one goes up, the other goes down. Similarly, a positive correlation is also evident between KPI8 and IV1.
This visual introduction to parallel coordinate plots will stop here, but even more use cases exist. I hope you are now empowered with the basics and can imagine how to use them in other cases such as outlier detection or a poor man’s way to identify optimality. There are many ways to represent data visually, but interactive parallel coordinate plots are an extremely scalable tool. In the comments section, let us know other uses you may imagine for this type of visualization.