K-Means and Optimizing K

hgwelec
hgwelec New Altair Community Member
edited November 5 in Community Q&A
Dear All,

I tried to find something similar in example setups but didn't find something similar.

I am trying to figure out how to perform optimization of K-Means (finding the optimal number of k) through cross-validation. I tried using an XValidation operator but i cannot get it to work. Here is my setup which i wish to change :

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename" value="/data-binary.csv"/>
        <parameter key="label_name" value="class"/>
    </operator>
    <operator name="CorrelationMatrix" class="CorrelationMatrix">
    </operator>
    <operator name="OperatorChain" class="OperatorChain" expanded="yes">
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="12"/>
            <parameter key="max_runs" value="50"/>
            <parameter key="max_optimization_steps" value="500"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="local_random_seed" value="8"/>
        </operator>
        <operator name="ClusterModelWriter" class="ClusterModelWriter">
            <parameter key="cluster_model_file" value="/models/clusterout.clm"/>
        </operator>
        <operator name="ClusterCentroidEvaluator" class="ClusterCentroidEvaluator">
            <parameter key="keep_example_set" value="true"/>
        </operator>
    </operator>
    <operator name="ClusterModelReader" class="ClusterModelReader">
        <parameter key="cluster_model_file" value="/models/clusterout.clm"/>
    </operator>
</operator>

Could someone please help?


Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    the problem is, that unsupervised learning can't really do any performance estimation. That's why it's called unsupervised: We simply don't know what's the true solution. So we cannot compare a clustering to another and say: Hey, that's one the true and the other ons is rubbish.
    That's why you are running into problems.
    But there are existing some measures which are heuristics for the goodness of clustering, but keep in mind, that heuristics may lead to non optimal solutions. You can enter these heuristics as you enter performance evaluators of regression and classification.

    Here's a small sample example for RapidMiner 5.0, that will show you how this works and  that heuristics may fail:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="296" width="1018">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="three ring clusters"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="optimize_parameters_grid" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="204" y="27">
            <list key="parameters">
              <parameter key="Clustering.k" value="[2.0;12;13;linear]"/>
            </list>
            <process expanded="true" height="610" width="1073">
              <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="30">
                <description>A cross-validation evaluating a decision tree model.</description>
                <process expanded="true">
                  <operator activated="true" class="k_means" expanded="true" name="Clustering">
                    <parameter key="k" value="10"/>
                  </operator>
                  <connect from_port="training" to_op="Clustering" to_port="example set"/>
                  <connect from_op="Clustering" from_port="cluster model" to_port="model"/>
                  <connect from_op="Clustering" from_port="clustered set" to_port="through 1"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                  <portSpacing port="sink_through 2" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="cluster_distance_performance" expanded="true" name="Performance"/>
                  <connect from_port="model" to_op="Performance" to_port="cluster model"/>
                  <connect from_port="test set" to_op="Performance" to_port="example set"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="source_through 2" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="input 1" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 2"/>
          <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

    Greetings,
      Sebastian