how to assess the "pureness" of clusters (e.g k-means) with labeled data?
hi,
I want to test clustering, I can assess the performance of clusters with operator "map clustering to labels" , but this only works if my size of clusters is equal to the number of labels...
If I try different k's with k-means, is there a way to assess the goodness of clusters with some validity measure like pureness of a cluster (or sum of all cluster purenesses?), based on the label distribution in a cluster?
I could of course look at the distribution of labels in every cluster, but is there some overview that gives me that performance of pureness?
Best Answer
-
There are several operators related to cluster performance that you might examine: "Cluster Density Performance" and "Cluster Distance Performance" and "Cluster Count Performance" and "Item Distribution Performance". I'm not entirely sure based on your question exactly what it is you would be trying to measure, but those operators provide some built-in ways for assessing cluster performance with respect to a label. Item Distribution Performance, for instance, lets you look at the overall Gini coefficient based on your label across your clusters.
The other option I thought of would be to take your cluster output and change it to your label, and change your original label into an attribute, and then look at a simple model using only your label as an input variable. Several modeling operators will then give you measures of "pureness" such as the decision tree operator that might be helpful for what you are trying to accomplish.
I hope this helps!
0
Answers
-
There are several operators related to cluster performance that you might examine: "Cluster Density Performance" and "Cluster Distance Performance" and "Cluster Count Performance" and "Item Distribution Performance". I'm not entirely sure based on your question exactly what it is you would be trying to measure, but those operators provide some built-in ways for assessing cluster performance with respect to a label. Item Distribution Performance, for instance, lets you look at the overall Gini coefficient based on your label across your clusters.
The other option I thought of would be to take your cluster output and change it to your label, and change your original label into an attribute, and then look at a simple model using only your label as an input variable. Several modeling operators will then give you measures of "pureness" such as the decision tree operator that might be helpful for what you are trying to accomplish.
I hope this helps!
0