"Text Mining - Document Similarity/Clustering"

rahi84
rahi84 New Altair Community Member
edited November 5 in Community Q&A
Hello All,

I am trying to perform document similarity/clustering in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:

1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity

Best Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,

    is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.

    Attached is a sample process.

    Best,
    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="false" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="6.4.000" expanded="true" height="76" name="Get Data" width="90" x="45" y="120">
            <process expanded="true">
              <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="179" y="75">
                <list key="attribute_values">
                  <parameter key="Text" value="&quot;Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="179" y="165">
                <list key="attribute_values">
                  <parameter key="Text" value="&quot;Lorem ipsum&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="94" name="Append" width="90" x="313" y="75"/>
              <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
              <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
              <connect from_op="Append" from_port="merged set" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Simply generate some test data</description>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="120"/>
          <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="120">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="447" y="120">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="648" y="120"/>
          <operator activated="true" class="cross_distances" compatibility="6.4.000" expanded="true" height="94" name="Cross Distances" width="90" x="782" y="120">
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_op="Get Data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
          <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • rahi84
    rahi84 New Altair Community Member
    Answer ✓
    Thank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    This sounds pretty reasonable.

    Could you post the XML of your process? Then i could check way easier for the mistake.

    Cheers,
    Martin
  • rahi84
    rahi84 New Altair Community Member
    Hi Martin,

    I have 'blacked out' the directory for privacy.

    Please see below the XML code:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="112" y="120">
            <parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="246" y="120">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="380" y="255">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="data_to_similarity" compatibility="6.4.000" expanded="true" height="76" name="Data to Similarity" width="90" x="581" y="255">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
          <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi,

    is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.

    Attached is a sample process.

    Best,
    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="false" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="6.4.000" expanded="true" height="76" name="Get Data" width="90" x="45" y="120">
            <process expanded="true">
              <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="179" y="75">
                <list key="attribute_values">
                  <parameter key="Text" value="&quot;Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="179" y="165">
                <list key="attribute_values">
                  <parameter key="Text" value="&quot;Lorem ipsum&quot;"/>
                </list>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="94" name="Append" width="90" x="313" y="75"/>
              <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
              <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
              <connect from_op="Append" from_port="merged set" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Simply generate some test data</description>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="120"/>
          <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="120">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="447" y="120">
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="648" y="120"/>
          <operator activated="true" class="cross_distances" compatibility="6.4.000" expanded="true" height="94" name="Cross Distances" width="90" x="782" y="120">
            <parameter key="measure_types" value="NumericalMeasures"/>
          </operator>
          <connect from_op="Get Data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
          <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • rahi84
    rahi84 New Altair Community Member
    Answer ✓
    Thank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.
  • mehak
    mehak New Altair Community Member

    hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.

  • Telcontar120
    Telcontar120 New Altair Community Member

    There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator.  It allows you to map a set of related tokens to a single token that represents the set.  You can create as many entries as you want.

     

    If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.

  • mehak
    mehak New Altair Community Member

    thank you so much for your response. can you please tell me how to make cluster of all of them?