"Cluster Number sorted after saving to a file"
vijaypshah
New Altair Community Member
Hello,
I am using kmeans clustering. After performing all this process I write out the results to file. However the cluster numbering is changed after I write it to file. It seems like the cluster numbering is sorted based on the cluster mean value.
For example: I saw in centroid table that cluster 13 had mean value of 100 200 200 200 200. However, when I load the save result file in other software to find out the means for cluster 13 it would be different. Then I saw that cluster 13 was renamed as cluster 0 when I saved the file (and other cluster number also changed).
It seems like the cluster numbering is sorted based on the cluster mean value. Is this true? I can send you data file if you want to experiment this with same dataset.
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="C:\sources.aml"/>
<parameter key="column_separators" value=";"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="15"/>
<parameter key="max_runs" value="1000"/>
<parameter key="max_optimization_steps" value="10000"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_resultstat.res"/>
</operator>
<operator name="ItemDistributionEvaluator" class="ItemDistributionEvaluator">
<parameter key="measure" value="SumOfSquares"/>
</operator>
<operator name="ClusterNumberEvaluator" class="ClusterNumberEvaluator">
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="cluster"/>
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="DataStatistics" class="DataStatistics">
</operator>
<operator name="ResultWriter (2)" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_stat.res"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\cluster15_em.dat"/>
<parameter key="attribute_description_file" value="C:\cluster15_em.aml"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$v[cluster]"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
<operator name="ClusterModelWriter" class="ClusterModelWriter">
<parameter key="cluster_model_file" value="C:\cluster15_em.clm"/>
</operator>
</operator>
I am using kmeans clustering. After performing all this process I write out the results to file. However the cluster numbering is changed after I write it to file. It seems like the cluster numbering is sorted based on the cluster mean value.
For example: I saw in centroid table that cluster 13 had mean value of 100 200 200 200 200. However, when I load the save result file in other software to find out the means for cluster 13 it would be different. Then I saw that cluster 13 was renamed as cluster 0 when I saved the file (and other cluster number also changed).
It seems like the cluster numbering is sorted based on the cluster mean value. Is this true? I can send you data file if you want to experiment this with same dataset.
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="C:\sources.aml"/>
<parameter key="column_separators" value=";"/>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="15"/>
<parameter key="max_runs" value="1000"/>
<parameter key="max_optimization_steps" value="10000"/>
</operator>
<operator name="ResultWriter" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_resultstat.res"/>
</operator>
<operator name="ItemDistributionEvaluator" class="ItemDistributionEvaluator">
<parameter key="measure" value="SumOfSquares"/>
</operator>
<operator name="ClusterNumberEvaluator" class="ClusterNumberEvaluator">
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="cluster"/>
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="DataStatistics" class="DataStatistics">
</operator>
<operator name="ResultWriter (2)" class="ResultWriter">
<parameter key="result_file" value="C:\cluster15_em_stat.res"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\cluster15_em.dat"/>
<parameter key="attribute_description_file" value="C:\cluster15_em.aml"/>
<parameter key="format" value="special_format"/>
<parameter key="special_format" value="$v[cluster]"/>
<parameter key="overwrite_mode" value="overwrite"/>
</operator>
<operator name="ClusterModelWriter" class="ClusterModelWriter">
<parameter key="cluster_model_file" value="C:\cluster15_em.clm"/>
</operator>
</operator>
Tagged:
0
Answers
-
Hi,
in which of the result files did you take a look? Into the one written with the result writer?
Greetings,
Sebastian0 -
Yes, the file written by result writer.
However, I think now I understand the problem. Cluster number are in the nominal values, like "cluster_0," cluster_1," etc... So the result writer will be taking cluster_0 as cluster=0 and so on. But, when I apply filter nominal2numeric this cluster number may be changing ie. cluster_0 might be 1 and cluster_1 might be 0.
So just to be safe, I recalculate mean from the attribute in other program where I use the numeric cluster numbers.
Possibly this is the flaw in way I designed the process .
Regards,
Vijay0