"Using a Kmeans model (clm file) created w/ a sample to cluster my population"
I created a kmeans model with 7 clusters using a sample of 10K records. I now want to cluster my whole population (~ 1 mill records) and am using the operators ClusterModelReader to read the cluster model I created w. the sample (clm file) and the operator ClusterModel2ExampleSet to cluster the entire population.
Your description for this operator states "This Operator clusters an exampleset given a cluster model. If an exampleSet does not contain id attributes it is probably not the same as the cluster model has been created on. Since cluster models depend on a static nature of the id attributes, the outcome on another exampleset with different values but same ids will be unpredictable.". Does this mean that it will only cluster the records that I used to create the model, and will not do any new records?
The process below finishes correctly but only clustered the records that had been clustered in the sample file. All other records had a blank cluster # in the output file.
Is there a way to use the model I created to cluster new records or do I have to run the kmeans algorithm on the 1 mill record file and not use the clm file created from the sample data?
Thanks in advance.
Keith
<operator name="ClusterModelReader" class="ClusterModelReader">
<description text="The cluster model 8051_Lifestyle_Matches_Excel.clm is the exact model I used for the Excel study so use it to cluster the population"/>
<parameter key="cluster_model_file" value="C:\Documents and Settings\krobinson\My Documents\rm_workspace\Clustering\8051_Lifestyle_Matches_Excel.clm"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="PSVExampleSetWriter" class="CSVExampleSetWriter">
<parameter key="column_separator" value="|"/>
<parameter key="csv_file" value="C:\Documents and Settings\krobinson\My Documents\rm_workspace\Clustering\8051_Population_Lifestyle.psv"/>
</operator>
</operator>
Your description for this operator states "This Operator clusters an exampleset given a cluster model. If an exampleSet does not contain id attributes it is probably not the same as the cluster model has been created on. Since cluster models depend on a static nature of the id attributes, the outcome on another exampleset with different values but same ids will be unpredictable.". Does this mean that it will only cluster the records that I used to create the model, and will not do any new records?
The process below finishes correctly but only clustered the records that had been clustered in the sample file. All other records had a blank cluster # in the output file.
Is there a way to use the model I created to cluster new records or do I have to run the kmeans algorithm on the 1 mill record file and not use the clm file created from the sample data?
Thanks in advance.
Keith
<operator name="ClusterModelReader" class="ClusterModelReader">
<description text="The cluster model 8051_Lifestyle_Matches_Excel.clm is the exact model I used for the Excel study so use it to cluster the population"/>
<parameter key="cluster_model_file" value="C:\Documents and Settings\krobinson\My Documents\rm_workspace\Clustering\8051_Lifestyle_Matches_Excel.clm"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="PSVExampleSetWriter" class="CSVExampleSetWriter">
<parameter key="column_separator" value="|"/>
<parameter key="csv_file" value="C:\Documents and Settings\krobinson\My Documents\rm_workspace\Clustering\8051_Population_Lifestyle.psv"/>
</operator>
</operator>
Find more posts tagged with
Sort by:
1 - 7 of
71
Hi,
here's the official side
I was **** sure having implemented this features for centroid cluster models. Maybe it got lost during the restructuring of the new cluster plugin's model class structure.
I will add it these days in the developer branch and get back to you, if I'm finished.
Greetings,
Sebastian
here's the official side

I will add it these days in the developer branch and get back to you, if I'm finished.
Greetings,
Sebastian
Hi,
I have a problem with clustering too. With my preprocessed Source I do:
What I want to do is basically apply a e.g. LinearRegression Operator to all of the Clusters individually.
Thx in advance.
Markus
I have a problem with clustering too. With my preprocessed Source I do:
I expected at least 2 models as output, but I get one with cluster column. At least I can't see the two Example Sets and I found no operator that splits the ExampleSet according to value.
...
<operator name="Datenset Clustern" class="OperatorChain" expanded="yes">
<operator name="KMeans" class="KMeans">
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
</operator>
<operator name="LinearRegression" class="LinearRegression">
<parameter key="keep_example_set" value="true"/>
</operator>
</operator>
What I want to do is basically apply a e.g. LinearRegression Operator to all of the Clusters individually.
Thx in advance.
Markus
Hi Markus,
this is possible using a combination of iterators and filtering:
Sebastian
this is possible using a combination of iterators and filtering:
<operator name="Root" class="Process" expanded="yes">Greetings,
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="2"/>
<parameter key="target_function" value="global and local models classification"/>
</operator>
<operator name="Datenset Clustern" class="OperatorChain" expanded="yes">
<operator name="KMeans" class="KMeans">
</operator>
<operator name="ValueIterator" class="ValueIterator" expanded="yes">
<parameter key="attribute" value="cluster"/>
<operator name="ExampleFilter" class="ExampleFilter">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="cluster = %{loop_value}"/>
</operator>
<operator name="LinearRegression" class="LinearRegression" breakpoints="after">
<parameter key="keep_example_set" value="true"/>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="C:\Dokumente und Einstellungen\sland\Eigene Dateien\yale\workspace\ClusteredLR - %{loop_value}.mod"/>
</operator>
</operator>
</operator>
</operator>
Sebastian
I wanted to try something similar this weekend ... and...
I checked the code, but indeed yes, K-Means and Clustermodels in general are not easy applicable (as far as I see) to new data
There are clustering algorithms out there which have no predefined way of cluster a new item, DBScan for instance (and so are "static"). But a centroid-oriented algorithm like k-means, whose clustering strategy implies how to assign a cluster to a new item ...
Well, lets wait for statement from the official side