How to reuse preprocessing results in a range of k-means clustering

albertoarenal
albertoarenal New Altair Community Member
edited November 2024 in Community Q&A

Hi all,

 

I am conducting a K-Means clustering analysis to several groups of documents and I would like to evaluate the clustering performance of different K ( K=4 to 20) by comparing their respective Davies-Bouldin indexes.

 

Previously to the clustering algorithm, I apply a preprocessing tasks (to transform cases, tokenize, filter stopwords, steeminng...creating a tf-if vector). The output of this preprocessing tasks is always the same for each group of texts (attached the general view of the process)

 

Now I am playing the process for each value of K, but I would like not to repeat this preprocessing tasks, which is the same for each group of text, every time I do the K clustering clustering and calculating davies-bouldin indexes, basically to save a lot of time 

 

Thank you very much in advance

Alberto

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Just add a loop after the preprocessing steps to run k-means and save the output you want and then cycle through the different k-values you would like using a loop macro.

     

    An alternative would be to Store the results after pre-processing them and then create a separate process that starts by Retrieving that dataset before each run of the clustering (also within a loop).  Either approach should work.

  • nmahesh
    nmahesh New Altair Community Member
    Answer ✓

    Hi Alberto,

     

    Have you tried using the store operator for the pre-processing? I would then create different processes to try out different parameter changes to your clustering and performance.

     

    Best,

    Nithin Mahesh

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Just add a loop after the preprocessing steps to run k-means and save the output you want and then cycle through the different k-values you would like using a loop macro.

     

    An alternative would be to Store the results after pre-processing them and then create a separate process that starts by Retrieving that dataset before each run of the clustering (also within a loop).  Either approach should work.

  • nmahesh
    nmahesh New Altair Community Member
    Answer ✓

    Hi Alberto,

     

    Have you tried using the store operator for the pre-processing? I would then create different processes to try out different parameter changes to your clustering and performance.

     

    Best,

    Nithin Mahesh

  • albertoarenal
    albertoarenal New Altair Community Member

    Thank you Brian,

    I´m a beginner using Rapidminer and I´ve not considered the option of storing/retrieving the output of the preprocessing tasks. It is a very good option and I´m sure it save me a lot of time.

     

    I wouldn´t like to take up much of your time, but I have already considered the use of a loop for proving diferent K, but I have not found the right way  to implement it. Could you provide an example? I tried with the cluster loop operator just between the retrieve operator and the clustering operator, but I don´t know how to change the k

     

    Thanks again
    alberto

     

  • albertoarenal
    albertoarenal New Altair Community Member

    Thank you Nithin, both Brian´s and your proposal about storing/retrieving the output of the preprocessing tasks have been very useful

    Alberto

     

  • Telcontar120
    Telcontar120 New Altair Community Member

    Sure, here's a sample process with k-means clustering and the Loop Parameters operator.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.5.003" expanded="true" height="68" name="Retrieve Sonar" width="90" x="45" y="85">
    <parameter key="repository_entry" value="//Samples/data/Sonar"/>
    </operator>
    <operator activated="true" class="loop_parameters" compatibility="7.5.003" expanded="true" height="103" name="Loop Parameters" width="90" x="246" y="85">
    <list key="parameters">
    <parameter key="Clustering.k" value="[2.0;10;8;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="k_means" compatibility="7.5.003" expanded="true" height="82" name="Clustering" width="90" x="313" y="85">
    <parameter key="k" value="10"/>
    </operator>
    <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
    <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve Sonar" from_port="output" to_op="Loop Parameters" to_port="input 1"/>
    <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
    <connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

  • albertoarenal
    albertoarenal New Altair Community Member

    Thank you  Telcontar120, I will prove this, it is vert useful, I really appreaciate your help!

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.