Clustering Versus Classification with shapeAI
Clustering and classification of geometric shapes are a separate, yet interrelated, topic. Learn to use them effectively through a visual story.
In the months before our marriage, my fiancée and I went to purchase silverware for our soon to be household. I must admit, I found spending many hours comparing sets of forks, knives, and spoons to be pretty boring. After a while they all started to look the same to me. I can still recall all this years later, but I only recently realized I can use this experience to relay some key concepts of data science, namely clustering and classification, and more specifically how they are similar yet distinct.
Let’s start by imagining a pile of silverware with various forks, knives, and spoons; for example, see the image below from my kitchen.
Oddballs aside, anyone could sort these into three nice groups as shown here.
Visually, this makes sense and is self-evident to the human brain. For data science, the geometric shapes need to be represented with numeric machine learning features, for example with Altair’s shapeAI technology. From this encoded representation, a clustering algorithm can be used to sort the data into unique clusters of data, conceptually represented here in a scatter plot with each of the three clusters assigned by color.
Even though the clustering algorithm doesn’t know what each group represents, it recognizes that each item within a cluster has more similarity to other members of the same cluster and less in common with members of other clusters. In contrast, to have an algorithm actually aware of what each group represents, we must provide class labels to each data point and then use a classification algorithm. If clustering results in only creating organized groupings, classification can be thought of as finding the boundaries between classes, as shown here.
Looking at the diagram, it is easy to imagine how the classification will make a prediction on a new point: it simply checks into which zone a new data belongs, as shown with the red point in this image.
Sadly, at the time I didn’t have these mental tools to properly articulate my silverware shopping boredom. After too many hours of comparisons, all I could mange when asked for an opinion was, “I don’t know. Looks like a fork.” My mind had started to think in terms of classifications and all nuance was lost. The difference between two forks was too small compared the differences between a fork and a knife, for example. Perhaps a better solution would have been to only show me forks, and not the knives or spoons. Revisiting out conceptual data set, if we considered only the forks, we could observe subtle differences between the data, allowing distinctions between different types of forks, shown here side by side.
Pondering this observation led me to realize that while classification labels are intrinsic to an individual item (“a fork is a fork”), the cluster to which an individual item belongs is set dependent (“All forks look similar when compared with knives and spoons”). In practice, this means we must be wary of assigning too much importance to the results of clustering unless we are confident the full dataset is a representative sample of what we are trying to learn. With this caveat in mind, I hope clusters can be used effectively in your next project to efficiently label data for supervised learning.