"placing new instances in clusters using cluster model"

Hi guys
In addition to clustering a dataset, RapidMiner can produce, store in repositories, and write in files cluster models. But how can an already built cluster model be used on a compatible but distinct dataset pls? I presume this is possible, due to the existence of cluster models. For instance if one wanted to place each new instance in an appropriate cluster, how can this be done in a process? Cheers!

Find more posts tagged with

AI Studio

Clustering

Accepted answers

All comments

IngoRM

Hi,

I am not sure if really all cluster models support this, but at least for the centroid-based models (K-Means, K-Medoids), you could simply use the operator "Apply Model". This is in perfect analogy to supervised models. You will find a simple example below.

Cheers,
Ingo


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="224" width="614">
      <operator activated="true" class="generate_data" compatibility="5.1.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="120">
        <parameter key="target_function" value="gaussian mixture clusters"/>
        <parameter key="number_examples" value="1000"/>
        <parameter key="number_of_attributes" value="2"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="5.1.008" expanded="true" height="94" name="Split Data" width="90" x="179" y="120">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.008" expanded="true" height="76" name="Clustering" width="90" x="313" y="30"/>
      <operator activated="true" class="apply_model" compatibility="5.1.008" expanded="true" height="76" name="Apply Model" width="90" x="447" y="120">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Clustering" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

pep

Thank you. Makes sense.
In the case of other schemes (DBSCAN) the model applier works too, but it asks for the id, which it seems to match with the id of examples from the originally clustered dataset to retrieve the cluster. Obviously it makes less sense, so in this case it is sounder to: cluster the original dataset, then apply 1- or 3 -nearest neighbour learner with the cluster attribute as label, and then apply its model on the second dataset to get its examples placed in clusters via classification.
Cheers.

IngoRM

Yip, I totally agree. For clustering schemes like DBScan or agglomerative clustering and others, it is probably much better to learn a supervised model from the clustered data and apply this one to the new data.

Cheers,
Ingo