"Meaning of Cluster Centroid Values"
Thieme
New Altair Community Member
Hello,
I'm interested to know more about the typical range and "best values" for the Cluster Centroid Evaluator output when using it for KMeans results. I'm trying to cluster texts and don't know, which k for KMeans would be the best.
Are there any papers about it available?
Is the range between 0 and 1? What does a value of 0,5 mean?
Why are the values in the log operator for example 0,3 and in the performance vector 0,03?
What is the difference between a value of 0,35 and 0,37? Is it a meaningful difference?
............
Also I like to know more about the ExampleDistribution and the ClusterSimilarity in the same way.
Thanks for any hints about it,
Thieme
Example:
Operator ItemDistributionEvaluator ClusterCentroidEvaluator ClusterDensityEvaluator
Lerner k Example distribution Avg. within centroid distance Avg. within cluster similarity
Kmeans 25 0,60771443854906100000 0,02404988245272690000 9,29991686777007000000
Kmeans 49 0,28560482296982600000 0,02650818439798560000 7,31511727219542000000
Kmeans 97 0,09401610548077870000 0,02706375671595980000 5,42341328360529000000
Kmeans 194 0,03743006260029780000 0,02850372897318460000 4,44378527924390000000
Kmedoid 25 0,06200754673336320000 0,03254450760601510000 6,52771248771305000000
Kmedoid 49 0,04150703349766510000 0,02898622290257270000 5,87483323589922000000
Kmedoid 97 0,02837976550188670000 0,03195484899801600000 5,26930481169764000000
Kmedoid 194 0,02037039714322890000 0,03105148039167270000 4,66888084366150000000
I'm interested to know more about the typical range and "best values" for the Cluster Centroid Evaluator output when using it for KMeans results. I'm trying to cluster texts and don't know, which k for KMeans would be the best.
Are there any papers about it available?
Is the range between 0 and 1? What does a value of 0,5 mean?
Why are the values in the log operator for example 0,3 and in the performance vector 0,03?
What is the difference between a value of 0,35 and 0,37? Is it a meaningful difference?
............
Also I like to know more about the ExampleDistribution and the ClusterSimilarity in the same way.
Thanks for any hints about it,
Thieme
Example:
Operator ItemDistributionEvaluator ClusterCentroidEvaluator ClusterDensityEvaluator
Lerner k Example distribution Avg. within centroid distance Avg. within cluster similarity
Kmeans 25 0,60771443854906100000 0,02404988245272690000 9,29991686777007000000
Kmeans 49 0,28560482296982600000 0,02650818439798560000 7,31511727219542000000
Kmeans 97 0,09401610548077870000 0,02706375671595980000 5,42341328360529000000
Kmeans 194 0,03743006260029780000 0,02850372897318460000 4,44378527924390000000
Kmedoid 25 0,06200754673336320000 0,03254450760601510000 6,52771248771305000000
Kmedoid 49 0,04150703349766510000 0,02898622290257270000 5,87483323589922000000
Kmedoid 97 0,02837976550188670000 0,03195484899801600000 5,26930481169764000000
Kmedoid 194 0,02037039714322890000 0,03105148039167270000 4,66888084366150000000
Tagged:
0
Answers
-
Hi,
before I can answer any of your questions regarding the difference of output, you have to provide a process I can run. So exchange any example source by an exampeset generator and post the process here. I will then come back and try to explain why this happens.
I guess there are many papers out there, but since I don't exactly know one, I would recommend this famous search engine google scholar. It probably will help you to find a suitable paper, or at least a paper having references to suitable papers...
Greetings,
Sebastian0 -
Hi again,
here is an example process:<operator name="Root" class="Process" expanded="yes">
My main questions concerning the results are:
<parameter key="logfile" value="Q:\DPP-02 Wissensbasierte Instandhaltung\8_Datenanalyse\RAPIDMINER Ressourcen\Versuchsergebnisse\AKTUELLER_DURCHLAUF\LOGGING.log"/>
<operator name="ExampleSetGenerator" class="ExampleSetGenerator" activated="no">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="12"/>
<parameter key="number_of_attributes" value="2"/>
<parameter key="attributes_lower_bound" value="0.0"/>
<parameter key="attributes_upper_bound" value="1.0"/>
</operator>
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="Q:\DPP-02 Wissensbasierte Instandhaltung\8_Datenanalyse\RAPIDMINER Ressourcen\INPRO Samples\DoppelAttribute.xls"/>
<parameter key="first_row_as_names" value="true"/>
<parameter key="id_column" value="2"/>
</operator>
<operator name="IOStorer _ExampleSet(2)" class="IOStorer">
<parameter key="name" value="ExampleSet_zwischenspeichern"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="remove_from_process" value="false"/>
</operator>
<operator name="GridParameterOptimization" class="GridParameterOptimization" expanded="yes">
<list key="parameters">
<parameter key="KMeans_Learner (2).k" value="2,3,4,6,10"/>
</list>
<operator name="KMeans_Learner (2)" class="KMeans" breakpoints="before">
<parameter key="k" value="3"/>
</operator>
<operator name="IOStorer_LastClusterModel" class="IOStorer" activated="no">
<parameter key="name" value="ClusterModel_zwischenspeichern"/>
<parameter key="io_object" value="ClusterModel"/>
<parameter key="remove_from_process" value="false"/>
</operator>
<operator name="ItemDistributionEvaluator (2)" class="ItemDistributionEvaluator" activated="no">
<parameter key="measure" value="SumOfSquares"/>
</operator>
<operator name="ClusterCentroidEvaluator_inner" class="ClusterCentroidEvaluator">
<parameter key="keep_example_set" value="true"/>
<parameter key="normalize" value="true"/>
</operator>
<operator name="Log" class="ProcessLog">
<parameter key="filename" value="Q:\DPP-02 Wissensbasierte Instandhaltung\8_Datenanalyse\RAPIDMINER Ressourcen\Versuchsergebnisse\AKTUELLER_DURCHLAUF\ParameterOptimierung.log"/>
<list key="log">
<parameter key="k von KMeans" value="operator.KMeans_Learner (2).parameter.k"/>
<parameter key="ClusterCentroid_averageDensity" value="operator.ClusterCentroidEvaluator_inner.value.avg_within_distance"/>
<parameter key="ClusterCentroid_DaviesBouldin" value="operator.ClusterCentroidEvaluator_inner.value.null"/>
</list>
<parameter key="persistent" value="true"/>
</operator>
</operator>
<operator name="ParameterSetWriter" class="ParameterSetWriter">
<parameter key="parameter_file" value="Q:\DPP-02 Wissensbasierte Instandhaltung\8_Datenanalyse\RAPIDMINER Ressourcen\Versuchsergebnisse\AKTUELLER_DURCHLAUF\ParameterSet.par"/>
</operator>
<operator name="ParameterSetLoader" class="ParameterSetLoader">
<parameter key="parameter_file" value="Q:\DPP-02 Wissensbasierte Instandhaltung\8_Datenanalyse\RAPIDMINER Ressourcen\Versuchsergebnisse\AKTUELLER_DURCHLAUF\ParameterSet.par"/>
</operator>
<operator name="ParameterSetter" class="ParameterSetter">
<list key="name_map">
<parameter key="KMeans_Learner" value="KMeans_OptimalLearner"/>
</list>
</operator>
<operator name="IORetriever _ExampleSet(2)" class="IORetriever">
<parameter key="name" value="ExampleSet_zwischenspeichern"/>
<parameter key="io_object" value="ExampleSet"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator name="KMeans_OptimalLearner (3)" class="KMeans">
</operator>
<operator name="IORetriever_ClusterModel" class="IORetriever" activated="no">
<parameter key="name" value="ClusterModel_zwischenspeichern"/>
<parameter key="io_object" value="ClusterModel"/>
<parameter key="remove_from_store" value="false"/>
</operator>
<operator name="ClusterCentroidEvaluator_outer (2)" class="ClusterCentroidEvaluator" activated="no">
<parameter key="keep_example_set" value="true"/>
<parameter key="normalize" value="true"/>
</operator>
</operator>- What does the value of 0.167 for k=2 mean?
- The optimal k=3 (because the distance is 0!?!) is not used in the "outer" kMeans Operator, where is my mistake?
Thieme0 - What does the value of 0.167 for k=2 mean?
-
Hi,
the value is the average within cluster distance. That's the average of the distances between all members of a cluster.
That's because you have a typo in the ParameterSetter operator's parameter. Instead of naming the final KMeans operator "KMeans_OptimalLearner (3)", you are writing "KMeans_OptimalLearner". Probably because you have copied it several times after it first worked.
Greetings,
Sebastian0 -
Hi Sebastian,
I expected a value of (nearly) 0.25, because in one cluster the average is 0 and in the other cluster the average is 0.5 (3 examples have a distance of 0 to the centroid and 4 examples have a distance of 1 to the centroid). Hm.
Thanks for the hint concerning the using of the opeartor names, now it works!
Greetings,
Thieme0 -
Hi,
I don't think that this is possible: The centroid of a KMeans clustering cannot be exactly the same as 3 points and one is a distance away. The centroid is always the mean vector from all cluster members. So this simply can never be the case.
Greetings,
Sebastian0 -
Hi Sebastian,
thanks for the hint, that the centroid is the mean vector. We (my colleague and me) used KMedoid also and in both cases we get values that we don't expect.
Could it be possible that during building the average multiplication instead of addition is used? Multiplication0 -
...............generates the results we get. In the java file addition is used, but perhaps this readable java file is not used during execution?!?
Greetings,
Thieme and AKeane0 -
Hi,
I think the solution is simple: The distance is measured in squared euclidean distance.
Greetings,
Sebastian0 -
Thank you for your support!
Now I can interpret the values of the cluster centroid evaluator using the log operator!
Greetings,
Thieme0