Cluster Performance DBScan and agglomerative Clustering
Hello,
I want to try differnt clustering algorithms like k-means, DBSCAN and agglomertive Clustering on my Dataset and compare the results in order to select the "best" one. For validation of centroid based clustering I know there are the operators "Cluster Distance Performance" and "Cluster Density Performance". But what about Performance Evaluation for DBSCAN or agglomerative Clustering? How can I do this?
Is their still something like the Global Silhouette Index as used in "Rapid Miner - Data Mining Use Cases and Business Analytics Application" for this kind of problem?
Thanks for your help.
Answers
-
Good question. I don't know about the Global Silhouette Index, but in the meantime, you do have a couple of other options. You could turn your clusters into labels and then attempt to diagnose them using predictive modeling algorithms, where "best" in this case would correspond presumably in terms of the ability to separate them using simple classifiers such as Naive Bayes or Decision Trees. Or if you already have labels (not the clusters themselves) then you could use "Map Clustering on Labels" and do something similar. Or run a predictive model using only the cluster attribute against your existing labels.
0 -
Thanks for your quick response.
Unfortunately I don't have any labels.
So your suggestion is to interpret the clusters as labels and then use e.g. a Decision Tree with the clusters as label attribute, right? But with this, how exactly can I see which one is the best cluster then? I dind't get that yet.
0 -
Then I am not at all sure what you mean by "the best cluster" in this context. If you have some way of assigning values to individual clusters (e.g., you have some other label variable) then you can do what I suggested above. But if you don't have an external label, then you can only evaluate your clusters with respect to your (presumably many different) input attributes, which you can do by making your clusters the label and then looking for differences in the patterns of what distinguishes one cluster from the others. But I am not sure how you could decide which individual cluster was best under that kind of scenario because I don't know what it would mean for one cluster to be "better" than another. You could however evaluate different clustering methods as a whole against each other, by seeing which ones produce clusters that are most distinct (based on turning the clusters into labels and then evaluating the strength of the models used to predict the clusters).
0 -
Yes sorry the word "best cluster" in my post was wrong. I ment I want to evaluate differnt clustering methods and compare these, but I didn't understand yet how I can evaluate the strength of the models used to predict the clusters e.g. with a Decision Tree as you suggested.
0 -
If you are using the clusters as labels, then once you build a few predictive models, you would simply use standard measures of model performance such as ROC AUC, accuracy, F1 score, etc. Take a look at the "Performance (classification)" operator for more details and many different performance measure options.
0 -
thanks @Telcontar120 - I was thinking along the same lines.
@hana1 you may consider also trying the Davis-Bouldin Index as implemented in the Cluster Distance Performance operator as this appears to me (?) to accomplish a similar goal.
I don't know the Global Silhouette Index either...always something new to learn about!
Scott
0 -
But can I use the Davies Bouldin index also for DBScan and agglomerative Clustering ? Because in the documentary it's said that the distance performance is only for centroid based clustering.
1 -