"distance measures between-clusters"

MarcosRL
MarcosRL New Altair Community Member
edited November 5 in Community Q&A
Hello friends of the community:
One question:
I need to calculate the distance between the clusters centroids to determine the distance from one cluster to another, is there this operator in rapid miner?
regards

Answers

  • marcin_blachnik
    marcin_blachnik New Altair Community Member
    Hi

    You can use Extract Cluster Prototypes to get the centroids as an ExampleSet and then use one of the operators from Similarity Computation in Modeling folder. If you need a value of the distance you can use for example the Data to Similarity operator.

    Best

    Marcin
  • MarcosRL
    MarcosRL New Altair Community Member
    Hello marcin.blachnik, Thanks for your answer
    I did find an example in the forum, the problem is that I can not interpret the results. I have four (4) documents to clustering with k-means and use k = 2 and get the following result:
    In output (Cross Distances ExampleSet)
    row  - request - document - distances
    1 1.0 1.0 0.012
    2 2.0 1.0 0.012
    3 3.0 1.0 0.012
    4 4.0 1.0 0.012
    5 2.0 2.0 0.016
    6 3.0 2.0 0.016
    7 4.0 2.0 0.016
    8 1.0 2.0 0.016

    How I can get the value of the centroids and the difference between them?
    What are the request and because I have four values ​​if I have only two clusters?

    attached the xml  process


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <parameter key="send_mail" value="always"/>
       <parameter key="notification_email" value="marcoslozina@gmail.com"/>
       <process expanded="true" height="404" width="681">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="75">
           <list key="text_directories">
             <parameter key="doc1" value="C:\Users\marcos\doc1"/>
             <parameter key="doc2" value="C:\Users\marcos\doc2"/>
             <parameter key="doc3" value="C:\Users\marcos\doc3"/>
             <parameter key="doc4" value="C:\Users\marcos\doc4"/>
           </list>
           <parameter key="prune_above_rank" value="0.05"/>
           <process expanded="true" height="415" width="758">
             <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
             <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120"/>
             <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter stopwords_pronombres_preposiciones" width="90" x="45" y="210">
               <parameter key="file" value="C:\Users\marcos\Desktop\stopwords\stopwords_pronombres_preposiciones.txt"/>
             </operator>
             <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
             <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.000" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="120">
               <parameter key="max_length" value="4"/>
             </operator>
             <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="179" y="210">
               <parameter key="condition" value="matches"/>
               <parameter key="regular_expression" value="word1|word2|"/>
             </operator>
             <connect from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Filter stopwords_pronombres_preposiciones" to_port="document"/>
             <connect from_op="Filter stopwords_pronombres_preposiciones" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
             <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
             <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
             <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="179" y="165">
           <parameter key="max_runs" value="100"/>
           <parameter key="determine_good_start_values" value="true"/>
           <parameter key="measure_types" value="NumericalMeasures"/>
           <parameter key="numerical_measure" value="CosineSimilarity"/>
         </operator>
         <operator activated="true" class="extract_prototypes" compatibility="5.2.008" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="342" y="47"/>
         <operator activated="true" class="cross_distances" compatibility="5.2.008" expanded="true" height="94" name="Cross Distances" width="90" x="380" y="165">
           <parameter key="measure_types" value="NumericalMeasures"/>
           <parameter key="numerical_measure" value="CosineSimilarity"/>
           <parameter key="only_top_k" value="true"/>
           <parameter key="k" value="8"/>
           <parameter key="compute_similarities" value="true"/>
         </operator>
         <connect from_op="Process Documents from Files" from_port="example set" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
         <connect from_op="Clustering" from_port="clustered set" to_op="Cross Distances" to_port="request set"/>
         <connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Cross Distances" to_port="reference set"/>
         <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
         <connect from_op="Cross Distances" from_port="request set" to_port="result 2"/>
         <connect from_op="Cross Distances" from_port="reference set" to_port="result 3"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
         <portSpacing port="sink_result 4" spacing="0"/>
       </process>
     </operator>
    </process>

    would need to get some distance indicator to know that clusters are closer

    thank you very much
    regards