Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
How to compare the performance of several clustering algorithms?
wessel
Dear All,
How to compare the performance of several clustering algorithms?
Weka provides a validation method called "classes to cluster evaluation".
This method basically does classification trough clustering.
Which is nice when your dataset contains a "class" attribute.
But what if your datasets on which you benchmark don't not contain any nominal attributes?
To me the natural solution now is to measure missing value replacement accuracy.
So split the data into a training and test set. Remove a random attribute value from each sample in the test set.
And try to predict back these removed attribute values.
Does anyone know a paper which uses this approach?
Is there some other approach which is typically used?
Best regards,
Wessel
Find more posts tagged with
AI Studio
Performance
Clustering
Accepted answers
All comments
pablo_admig
Hi wessel,
Well I'm facing a similar problem right now, I think we're pointing to the same thing, so I'll give you my opinion:
When I build 2 clustering models, I want to know which is the best, but, it depends mainly on the business problem; i.e, sometimes you would have 3/4 clusters to explain a general behaviour of your clients to a Marketing Manager.
But, if you pick up one model, and you like the segments, and you want to test the performance of it for beeing sure of that segmentation is representative for further clustering (i.e., clustering data for the next month) , I think, (and here is my answer/question for others
) that the validation would be to compare the distributions of each variable for the training and testing data set. So, if the distributions are similar for each variable, you can asset that the clustering model catch up the pattern.
We can do some testing and share the results,
Best regards,
Pablo.
wessel
Let's see if I understand correctly.
If an algorithm returns a very similar distribution on the training and test set, then the algorithm performs good?
But how does a clustering algorithm return distributions?
It returns clusters.
And the clusters on the training set will be different from the clusters on the test set.
Best regards,
Wessel
pablo_admig
Hi Wessel,
Well, I know that R/RM, for example, creates a new Clustering model each time that you run it, but, I'm testing another software Powerhouse Analytics (1), and in it each time that you create a Clustering modelling, it stores the "weights" that produces the segmentation, so you can compare Training Set Vs Test Set.
Now, respect to distributions and comparisons between clusters, I was refering to the posibility of comparing distribution of one variable in training and test set.
If the clustering model changes every time that it executes, what would be the necessity for testing it? What would be the measures for comparing performances?
I think comparing variables distributions (or Inter Quartile Ranges) between clusters would be one (of many) answers, i.e.
Cluster 1 has the inter quarile range between 20 and 30.
Cluster 2 has the inter quarile range between 35 and 55.
I could say that the variable Age is very discriminative. On the other hand, if you have overlapping ranges, that variable is not very discriminative.
Thank you,
Best regards,
Pablo.
(1) Unfortunally, this web page is in spanish, but you can set the software language to english:
http://www.dataxplore.com.ar/tecnologia.php
wessel
It is not easy to understand what you are writing.
You can store cluster assignments and cluster models and cluster weights in Rapid Miner also.
I think what you are suggestion about quartile range is related to the measures:
- within cluster similarity
- between cluster distance
Best regards,
Wessel
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups