"Read from Database, Process Documents From Data, kMeans Clustering"

Unknown
edited November 5 in Community Q&A
Greetings - My question concerns what I imagine is something very simple that as a newbie, I am merely overlooking. However, after reading the manual and similar posts (like this one: http://rapid-i.com/rapidforum/index.php/topic,5518.0.html ), I am still at a loss.

I am reading data from a DB with the following columns:
  • entity_id
  • raw_text
My current process grabs each row, turns it into a doc, processes the doc, the attempts to cluster them using the k-means clustering operator. My goal is to have the docs clustered, but show the entity_id value instead of the id value generated by rapidMiner. I have attempted the following with no luck:
  • Add Set Role operator for the attribute entity_id to id, after Process Documents From Data operator
- Doesn't work as in order for the entity_id to show up after the Process Documents From Data operator, it appears I need to check the "Add meta information" box. If I do this, the k-means clustering operator complains about the non nominal values. Specifically, values such as title, language, etc. These values do not exist in my data and appear to be added by the Process Documents From Data operator.
  • Add Set Role operator for the attribute entity_id to id, before Process Documents From Data operator
- Same issue as above. Entity_id doesn't make it through without checking the "Add meta information" box. As a result, the k-means cluster complains about the title, langauge, robots, attributes that I did not create.


Many thanks in advance for helping me through what I imagine is a total noob oversight.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hm, where does RapidMiner create an id? If you add the id before Process Documents, it survives until the end of the process, even if you add a clustering algorithm in the end. Please have a look at the attached process. If you keep having problems, please post your process setup, as described in the link in my signature.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="583" width="889">
          <operator activated="true" class="generate_nominal_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Nominal Data" width="90" x="112" y="30"/>
          <operator activated="true" class="generate_id" compatibility="5.3.000" expanded="true" height="76" name="Generate ID" width="90" x="246" y="30"/>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.000" expanded="true" height="76" name="Nominal to Text" width="90" x="380" y="30"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.005" expanded="true" height="76" name="Process Documents from Data" width="90" x="514" y="30">
            <list key="specify_weights"/>
            <process expanded="true" height="583" width="889">
              <operator activated="true" class="text:tokenize" compatibility="5.2.005" expanded="true" height="60" name="Tokenize" width="90" x="194" y="40">
                <parameter key="mode" value="specify characters"/>
                <parameter key="characters" value=".: "/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.3.000" expanded="true" height="76" name="Clustering" width="90" x="648" y="30"/>
          <connect from_op="Generate Nominal Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Hi Marius - Thanks for responding so quickly! In reference to your first, question I believe I am referring to the "id" column that is generated in the Results View. It appears to correspond to row number from the exampleSet. Also, each of the documents within the clusters are identified with this same id value.

    My data looks like this in the database I am querying. In real life, that values within the raw_text column are significantly longer. Also, I rename my database's id column to entity_id and only return ids over 1000, as well as limit it to 100 rows.
    idraw_text
    1003This entity is about snowboarding and other fun winter sports
    2097This entity is about orange juice and pancakes
    2318This entity is about elephants
    The database call works fine. Once data is in, here is my process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="476" width="949">
          <operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
            <parameter key="connection" value="Ramsey Test"/>
            <parameter key="query" value="SELECT raw_text, id AS entity_id&#10;FROM ram_entity&#10;WHERE raw_text IS NOT NULL AND id &gt; 1000&#10;LIMIT 100"/>
            <enumeration key="parameters"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="246" y="165">
            <parameter key="name" value="entity_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="absolute"/>
            <parameter key="prune_below_absolute" value="2"/>
            <parameter key="prune_above_absolute" value="999"/>
            <list key="specify_weights"/>
            <process expanded="true" height="745" width="1027">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.2.003" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
              <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="165" y="28"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="470" y="29"/>
              <operator activated="true" class="text:stem_snowball" compatibility="5.2.004" expanded="true" height="60" name="Stem (Snowball)" width="90" x="607" y="30"/>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
              <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.2.008" expanded="true" height="76" name="Clustering" width="90" x="581" y="30">
            <parameter key="k" value="3"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
          </operator>
          <connect from_op="Read Database" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Set Role" from_port="original" to_port="result 3"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    The goal is to have the k-means cluster use raw_text to cluster, then identify each document by its entity_id, not the row number (which seems to be named "id" and is always between 1 and 100). For example, if I am looking at the "Folder View", cluster_0 might expand to 2220, 3862, and 1034 (entity_id). Not 12, 44, 86 (row number, called id)