How to compare the performance of several clustering algorithms?

Dear All,

How to compare the performance of several clustering algorithms?
Weka provides a validation method called "classes to cluster evaluation".
This method basically does classification trough clustering.
Which is nice when your dataset contains a "class" attribute.

But what if your datasets on which you benchmark don't not contain any nominal attributes?
To me the natural solution now is to measure missing value replacement accuracy.
So split the data into a training and test set. Remove a random attribute value from each sample in the test set.
And try to predict back these removed attribute values.

Does anyone know a paper which uses this approach?
Is there some other approach which is typically used?

Best regards,

Wessel

Find more posts tagged with

AI Studio

Performance

Clustering

Accepted answers

All comments

pablo_admig

Hi wessel,

Well I'm facing a similar problem right now, I think we're pointing to the same thing, so I'll give you my opinion:

When I build 2 clustering models, I want to know which is the best, but, it depends mainly on the business problem; i.e, sometimes you would have 3/4 clusters to explain a general behaviour of your clients to a Marketing Manager.
But, if you pick up one model, and you like the segments, and you want to test the performance of it for beeing sure of that segmentation is representative for further clustering (i.e., clustering data for the next month) , I think, (and here is my answer/question for others

) that the validation would be to compare the distributions of each variable for the training and testing data set. So, if the distributions are similar for each variable, you can asset that the clustering model catch up the pattern.

We can do some testing and share the results,

Best regards,
Pablo.

wessel

Let's see if I understand correctly.
If an algorithm returns a very similar distribution on the training and test set, then the algorithm performs good?

But how does a clustering algorithm return distributions?
It returns clusters.
And the clusters on the training set will be different from the clusters on the test set.

Best regards,

Wessel

pablo_admig

Hi Wessel,

Well, I know that R/RM, for example, creates a new Clustering model each time that you run it, but, I'm testing another software Powerhouse Analytics (1), and in it each time that you create a Clustering modelling, it stores the "weights" that produces the segmentation, so you can compare Training Set Vs Test Set.

Now, respect to distributions and comparisons between clusters, I was refering to the posibility of comparing distribution of one variable in training and test set.

If the clustering model changes every time that it executes, what would be the necessity for testing it? What would be the measures for comparing performances?
I think comparing variables distributions (or Inter Quartile Ranges) between clusters would be one (of many) answers, i.e.

Cluster 1 has the inter quarile range between 20 and 30.
Cluster 2 has the inter quarile range between 35 and 55.

I could say that the variable Age is very discriminative. On the other hand, if you have overlapping ranges, that variable is not very discriminative.

Thank you,
Best regards,
Pablo.

(1) Unfortunally, this web page is in spanish, but you can set the software language to english: http://www.dataxplore.com.ar/tecnologia.php

wessel

It is not easy to understand what you are writing.
You can store cluster assignments and cluster models and cluster weights in Rapid Miner also.

I think what you are suggestion about quartile range is related to the measures:
- within cluster similarity
- between cluster distance

Best regards,

Wessel