"Miscellaneous Issues Related to Clustering"
Pinguicula
New Altair Community Member
Hi Everybody,
Since im neither mathematician nor a computer scientist the answer to the following question might be quite simple but I'm still a little bit confused about the Clustering algorithms in RM:
A) Is it normal behaviour of the Kmeans algorithm that it needs much more time (at least 10x) if the "add characterization" button is switched on?
Is DBscan the only density based algorithm currently implemented in RM?
C) As far as I understand the Kmeans algorithm should be capable of producing clusters of different cardinality. However, in my datasets the output clusters differ only slightly in their cardinality. Size of the largest cluster at most 5 or 6 times the size of the smallest one. Is this more likely to be a characteristic of the dataset or an artefact of the algorithm?
D) Using the ClusterCentroidEvaluator, the output indicates negative average distances? Is it possible? Or just ignore the sign?
E) Are there performance vector in order to evaluate the pairwaise similarity / overlap between clusters produced by kmeans? Can I manipulate the output of kmeans in a way that the ClusterDensityEvaluator and the ItemDistributionEvaluator accept it as an input?
F) Is there any particular reason why the Ward method is not implemented as clustering algorithm in hierarchical cluster models? (it is still quite often used in the publications in my discipline)
Best
Norbert
Since im neither mathematician nor a computer scientist the answer to the following question might be quite simple but I'm still a little bit confused about the Clustering algorithms in RM:
A) Is it normal behaviour of the Kmeans algorithm that it needs much more time (at least 10x) if the "add characterization" button is switched on?
Is DBscan the only density based algorithm currently implemented in RM?
C) As far as I understand the Kmeans algorithm should be capable of producing clusters of different cardinality. However, in my datasets the output clusters differ only slightly in their cardinality. Size of the largest cluster at most 5 or 6 times the size of the smallest one. Is this more likely to be a characteristic of the dataset or an artefact of the algorithm?
D) Using the ClusterCentroidEvaluator, the output indicates negative average distances? Is it possible? Or just ignore the sign?
E) Are there performance vector in order to evaluate the pairwaise similarity / overlap between clusters produced by kmeans? Can I manipulate the output of kmeans in a way that the ClusterDensityEvaluator and the ItemDistributionEvaluator accept it as an input?
F) Is there any particular reason why the Ward method is not implemented as clustering algorithm in hierarchical cluster models? (it is still quite often used in the publications in my discipline)
Best
Norbert
Tagged:
0
Answers
-
Hi,
Yes. Actually, the KMeans algorithm did not need more time - it is exactly the same amount then without characterization - but the characterization itself needs it. For each resulting cluster, a prediction models is learned and the most discrimanting features and thresholds are presented.
Is it normal behaviour of the Kmeans algorithm that it needs much more time (at least 10x) if the "add characterization" button is switched on?
Very similar to DBScan and also in some sense density based is the SupportVectorClustering (or SVClustering?).
Is DBscan the only density based algorithm currently implemented in RM?
There is actually no reason why clusters of different cardinality should be preferred. Imagine you have two groups (same cardinality) in a data space with only one dimension where all points of one group have value -1 in this dimension and all points of the other have 1 in the same dimension. The inner cluster distances are 0 for each group and the inter cluster distances are 2 for both groups. If k is set to 2, of course, each group will build a cluster. Same cardinality.
As far as I understand the Kmeans algorithm should be capable of producing clusters of different cardinality. However, in my datasets the output clusters differ only slightly in their cardinality. Size of the largest cluster at most 5 or 6 times the size of the smallest one. Is this more likely to be a characteristic of the dataset or an artefact of the algorithm?
Let's now say that the first group has 99 points and the second one only 1 point. And again it will result in 2 clusters (cardinality of 99 and 1 respectively) for the same reasons. So as you can see, there is no reason at all for preferring clusters with / without similar cluster cardinalities.
I hope you did get the point.
Since all fitness criteria in RapidMiner are maximized and the distance should of course be minimized, we simply calculate -1 * distance in order to mirror the optimization direction.
Using the ClusterCentroidEvaluator, the output indicates negative average distances? Is it possible? Or just ignore the sign?
I am not too deep into clustering myself but since KMeans produces non-overlapping clusters I would not really now how to calculate overlap. About the pairwise similarity: As far as I remember it was shown by Hastie / Tibshirani and / or Friedman that optimizing the pairwise similarity between the clusters is equivalent to the average distances to the cluster centroids. And this should be available.
Are there performance vector in order to evaluate the pairwaise similarity / overlap between clusters produced by kmeans? Can I manipulate the output of kmeans in a way that the ClusterDensityEvaluator and the ItemDistributionEvaluator accept it as an input?
About the other two questions: I have no idea right now. Anyone else?
Very simple: no one did it yet / asked for it yet / paid for it yet / ... So you are the first
Is there any particular reason why the Ward method is not implemented as clustering algorithm in hierarchical cluster models? (it is still quite often used in the publications in my discipline)
Cheers,
Ingo0 -
Hi everybody,
Thank you Ingo for your informative reply.
Just some add ons for clarificaction.
Does this also mean that if the output doesn't show any characteristics for some clusters (happened to me)Yes. Actually, the KMeans algorithm did not need more time - it is exactly the same amount then without characterization - but the characterization itself needs it. For each resulting cluster, a prediction models is learned and the most discrimanting features and thresholds are presented.
it can not find any good discriminating features and thresholds?
Best
Norbert0 -
Hi,
No, not necessarily. Only the very "primitive" OneR learner is used for characterization. If one attribute alone is not enough, no good characterization is given. Nevertheless, the clusterin still might be valid and useful. If you want to be sure, you could skip the characterization, change the attribute role from "cluster" to "label" after clustering and learn a more sophisticated model.
Does this also mean that if the output doesn't show any characteristics for some clusters (happened to me)
it can not find any good discriminating features and thresholds?
Cheers,
Ingo0