"KMeans-Clusterting"

Tammi
Tammi New Altair Community Member
edited November 5 in Community Q&A
Good morning,

I'm using the KMeans-Cluster Model to merge similar measurement values in one group (cluster).

The process is quite similar to the Help-tutorial step 4 KMeans-Model with Iris-data.

I have 10 different attributes, but it seems, KMeans Methode merge measurement values especially
depending on the first 3 attributes.

Each attribute should be the same weight at this method, right?


Maybe you had the same experience, or could give me an advice.

Are there some settings to change for the process?


Thanks.

Answers

  • haddock
    haddock New Altair Community Member
    Hi Tammi,

    Good morning and welcome to Rapidminer! At a general level you'll find that you will get more useful answers if you post the XML of your process, and bear in mind that this is open source, so you can check the code for yourself. Looking at Kmeans.java does not seem to support your view that..
    KMeans Methode merge measurement values especially
    depending on the first 3 attributes.
    So, could it be the data that produces this mirage for you? I've often suspected errors that turn out to be caused by my own expectations; but I guess we should be doing data mining to let the data speak to us, rather than the other way round!

    Hope you find the answer.

    Good wekend.
  • Tammi
    Tammi New Altair Community Member
    Hi haddock,

    thanks for the fast response.

    I agree, you have to be careful about the results of the data process. What will be expected - and the real results.

    In my case, the values of the first 3 attributes are very different. The values of the other attributes are more "similar".

    This could be a reason for the cluster process.

    Anyway it would be great if the process also make some clusters in relation to the attritbutes 4 - 10.

    Have a nice weekend.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Root">
        <description>&lt;p&gt;In many cases, no target attribute (label) can be defined and the data should be automatically grouped. This procedure is called &amp;quot;Clustering&amp;quot;. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators. &lt;p&gt; &lt;p&gt; In this experimen, the well-known Iris data set is loaded (the label is loaded, too, but it is only used for visualization and comparison and not for building the clusters itself). One of the most simple clustering schemes, namely KMeans, is then applied to this data set. Afterwards, a dimensionality reduction is performed in order to better support the visualization of the data set in two dimensions. &lt;/p&gt;&lt;p&gt; Just perform the process and compare the clustering result with the original label (e.g. in the plot view of the example set). You  can also visualize the cluster model itself. &lt;/p&gt;</description>
        <parameter key="logverbosity" value="warning"/>
        <process expanded="true" height="604" width="981">
          <operator activated="true" class="retrieve" compatibility="5.0.000" expanded="true" height="60" name="Retrieve" width="90" x="44" y="31">
            <parameter key="repository_entry" value="../../data/Iris"/>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.0.000" expanded="true" height="76" name="KMeans" width="90" x="179" y="30">
            <parameter key="k" value="3"/>
          </operator>
          <operator activated="true" class="singular_value_decomposition" compatibility="5.0.000" expanded="true" height="94" name="SVDReduction" width="90" x="715" y="30"/>
          <connect from_op="Retrieve" from_port="output" to_op="KMeans" to_port="example set"/>
          <connect from_op="KMeans" from_port="cluster model" to_port="result 4"/>
          <connect from_op="KMeans" from_port="clustered set" to_op="SVDReduction" to_port="example set input"/>
          <connect from_op="SVDReduction" from_port="example set output" to_port="result 1"/>
          <connect from_op="SVDReduction" from_port="original" to_port="result 2"/>
          <connect from_op="SVDReduction" from_port="preprocessing model" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="126"/>
          <portSpacing port="sink_result 5" spacing="0"/>
        </process>
      </operator>
    </process>
  • Hello Tammi,

    The attributes' range has an effect on its influence on the clusters found. In the example that follows, I normalise the first three attributes but not the fourth and I observe that the clustering improves on the iris data set. For your problem, it's a question for the domain expert to know whether the ranges should be changed or not.

    regards

    Andrew
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.003" expanded="true" name="Root">
        <description>&lt;p&gt;In many cases, no target attribute (label) can be defined and the data should be automatically grouped. This procedure is called &amp;quot;Clustering&amp;quot;. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators. &lt;p&gt; &lt;p&gt; In this experimen, the well-known Iris data set is loaded (the label is loaded, too, but it is only used for visualization and comparison and not for building the clusters itself). One of the most simple clustering schemes, namely KMeans, is then applied to this data set. Afterwards, a dimensionality reduction is performed in order to better support the visualization of the data set in two dimensions. &lt;/p&gt;&lt;p&gt; Just perform the process and compare the clustering result with the original label (e.g. in the plot view of the example set). You  can also visualize the cluster model itself. &lt;/p&gt;</description>
        <parameter key="logverbosity" value="warning"/>
        <process expanded="true" height="604" width="981">
          <operator activated="true" class="retrieve" compatibility="5.1.003" expanded="true" height="60" name="Retrieve" width="90" x="44" y="31">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="normalize" compatibility="5.1.003" expanded="true" height="94" name="Normalize" width="90" x="45" y="120">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="a1"/>
            <parameter key="attributes" value="a1|a2|a3"/>
            <parameter key="method" value="range transformation"/>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.1.003" expanded="true" height="76" name="KMeans" width="90" x="45" y="255">
            <parameter key="k" value="3"/>
          </operator>
          <operator activated="true" class="map_clustering_on_labels" compatibility="5.1.003" expanded="true" height="76" name="Map Clustering on Labels" width="90" x="45" y="345"/>
          <operator activated="true" class="performance" compatibility="5.1.003" expanded="true" height="76" name="Performance" width="90" x="246" y="255"/>
          <connect from_op="Retrieve" from_port="output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="KMeans" to_port="example set"/>
          <connect from_op="KMeans" from_port="cluster model" to_op="Map Clustering on Labels" to_port="cluster model"/>
          <connect from_op="KMeans" from_port="clustered set" to_op="Map Clustering on Labels" to_port="example set"/>
          <connect from_op="Map Clustering on Labels" from_port="example set" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="126"/>
        </process>
      </operator>
    </process>
  • Tammi
    Tammi New Altair Community Member
    Hi Andrew,

    thanks a lot for the hint. In my case the value range of the first 2 attributes are very wide in comparison
    to the other range of attributes.


    Thanks.

    Have a nice day.

    Tammi