"K-means: Finding optimum # of clusters with Davies-Bouldin index"

dynera
dynera New Altair Community Member
edited November 5 in Community Q&A
I am clustering text from a discussion forum using k-means.  I have followed the sample process called "09_KMeansWithPlot" (thanks Ingo!) to determine the optimum number of clausters via the following measures: (W) Avg Within Cluster Distance and (DB) Davies-Bouldin Index.

My understanding is that the DB index "is a function of the ratio of the sum of within-cluster (i.e. intra-cluster) scatter to between cluster (i.e. intercluster) scatter. A good value for the number of clusters is associated to lower values of this index."

That being said I am having trouble interpreting my results...
  • Why are some of my DB values negative infinity?
  • Some of my DB graphs have a gentle negative slopes - How do I know where the optimum number of clusters is because it appears there is no "elbow" in the trend line?
  • Why do some of the charts only plot a certain number of clusters? For example the x-axis shows, 2,12,22,etc. instead of all the clusters, 1, 2, 3,...22 etc.?
  • Are there any rules of thumb I should keep in mind when using the DB index against text data?
Thanks Rapidminers!

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    just a quick guess from your descriptions: is it possible, that some of your clusters are empty? That could explain why you have infinity values, and also why some clusters are not shown.

    Best regards,
    Marius
  • dynera
    dynera New Altair Community Member
    Hi Marius,

    Thanks for the suggestion.  I read one of your previous posts (Avoid empty clusters in Cluster Model) to see if I can identify if I have any empty clusters.  Like the previous post, I too am generting prototypes by looping through the k-means parameters. 

    I can't seem to insert the correct operators used in your prior post ("Declare Missing Value" and "Filter" operators) to see I have any empty clusters. I have attached a copy of my process for you to look at.

    What is the accepted approach when dealing with empty clusters?  Simply remove them?  By removing empty clusters should I expect to see a complete DB graph (the reason for developing this Rapidminer process in the first place)?

    Thanks for the advice Marius!

    Paul


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve CA_Clarity_Open_Workbench_Combined" width="90" x="45" y="75">
            <parameter key="repository_entry" value="../Data/CA_Clarity_Open_Workbench_Combined"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="75">
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="9999"/>
            <parameter key="prune_above_rank" value="0.05"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.000" expanded="true" height="60" name="Extract Content" width="90" x="112" y="30">
                <parameter key="minimum_text_block_length" value="3"/>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="112" y="120"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="210"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="112" y="300"/>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="246" y="300">
                <parameter key="file" value="C:\Users\kitpa01\Documents\CA\Transformation Team\Text_Mining\Message_Board\Individual_Clarity_Discussion_Topics\XOG_GEL_WSDL_Filter_Dictionary.txt"/>
              </operator>
              <operator activated="true" class="text:replace_tokens" compatibility="5.3.000" expanded="true" height="60" name="Replace Tokens" width="90" x="380" y="300">
                <list key="replace_dictionary">
                  <parameter key="cluster" value="cluster1"/>
                </list>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="514" y="300">
                <parameter key="min_chars" value="2"/>
              </operator>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Replace Tokens" to_port="document"/>
              <connect from_op="Replace Tokens" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_parameters" compatibility="5.3.005" expanded="true" height="112" name="Loop Parameters" width="90" x="447" y="75">
            <list key="parameters">
              <parameter key="Clustering.k" value="[2.0;100.0;10;linear]"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="k_means" compatibility="5.3.005" expanded="true" height="76" name="Clustering" width="90" x="179" y="120">
                <parameter key="k" value="100"/>
              </operator>
              <operator activated="true" class="cluster_distance_performance" compatibility="5.3.005" expanded="true" height="94" name="Performance" width="90" x="380" y="120"/>
              <operator activated="true" class="log" compatibility="5.3.005" expanded="true" height="94" name="Log" width="90" x="514" y="30">
                <list key="log">
                  <parameter key="k" value="operator.Clustering.parameter.k"/>
                  <parameter key="(DB) Davies-Bouldin Index" value="operator.Performance.value.DaviesBouldin"/>
                  <parameter key="(W) Avg Within Cluster Distance" value="operator.Performance.value.avg_within_distance"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Clustering" to_port="example set"/>
              <connect from_op="Clustering" from_port="cluster model" to_op="Performance" to_port="cluster model"/>
              <connect from_op="Clustering" from_port="clustered set" to_op="Performance" to_port="example set"/>
              <connect from_op="Performance" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Performance" from_port="example set" to_op="Log" to_port="through 2"/>
              <connect from_op="Log" from_port="through 1" to_port="performance"/>
              <connect from_op="Log" from_port="through 2" to_port="result 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_result 1" spacing="0"/>
              <portSpacing port="sink_result 2" spacing="0"/>
              <portSpacing port="sink_result 3" spacing="0"/>
              <portSpacing port="sink_result 4" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve CA_Clarity_Open_Workbench_Combined" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Loop Parameters" to_port="input 1"/>
          <connect from_op="Loop Parameters" from_port="result 1" to_port="result 1"/>
          <connect from_op="Loop Parameters" from_port="result 2" to_port="result 2"/>
          <connect from_op="Loop Parameters" from_port="result 3" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>


  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    when referring to another post, the easiest way to make in retrievable by others is to post a link to the topic.
    I can reproduce the problem, but for reference to the topic you mention please post a link here.

    Btw, if you are dealing with natural language (e.g. english), you should consider to add a Stemming operator to your document processing.

    Regarding the Davies-Bouldin-Index I created an internal ticket requesting to discuss how to deal with empty clusters. Until that is fixed, you have to work around that as described in the other topic which I currently can't find.

    Best regards,
    Marius
  • dynera
    dynera New Altair Community Member
    Hi Marius,

    My apologies for not includng the link I was referring to in the previous post: http://rapid-i.com/rapidforum/index.php/topic,5689.msg20111.html#msg20111

    ...And thanks for submitting the internal ticket!

    I took your advice and added the stemming operator to my process but I still end up with empty clusters.

    Can you recommend any other clustering operators that deal with the empty cluster issue that won't throw off a Davies-Bouldin plot (or similar type of plot for selecting an optimum number of clusters)?

    Thanks again, Marius!
  • MariusHelf
    MariusHelf New Altair Community Member
    Well, the stemming is not supposed to solve the empty clusters problem, but is rather a general improvement of text preprocessing :)

    Concerning the empty clusters, the solution provided in the other thread does not work in your case, since you are not interested in the prototypes themselves, but want to calculate the performance with the Performance operator. I am not sure if one of the other clustering implementation in RapidMiner can guarantee non-empty clusters, just give them a try. They are found in the same operator group as k-Means.

    Best regards,
    Marius
  • dynera
    dynera New Altair Community Member
    Thanks for the help Marius.

    I turns out that k-Means won't help me determine the optimum number of clusters due to the empty clusters produced...which is still useful information.

    Do you know of a way/process to optimize the number of clusters with DBSCAN using "epsilon" and "min pioints" parameters?  I could not find a looping operator I can use with DBSCAN like you can with k-Means.

    Paul
  • Hello

    You might find these posts interesting.

    http://rapidminernotes.blogspot.com/search/label/ClusterValidity

    The first one uses DBScan and the one labelled IV a clustering result as a classifier.

    regards

    Andrew