set categories by finding words in a document

vijen
vijen New Altair Community Member
edited November 5 in Community Q&A
Hello everyone,

I am new to Rapidminer but enjoying the ride so far. I am stuck with a couple of issues..
First, I have a set of 3 categories, each one is defined by 5 words.. meaning that if a document has those 5 words in its corpus then I would like to assign that document to that particular category.
In other words, I would like to go through my dataset, search the corpus for the 5 words of each category and associate the document to the category in which it finds all 5 words.
Is there a way to do that in Rapidminer?

Cheers,

D

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    you should use the Text Processing extension to tokenize your documents. You end up with an example set which contains the documents as rows and the tokens as columns. If the value of a column is greater than 0 in a row it means that the word appeared in the corresponding document. You can then use Generate Attributes to create a new attribute by checking if the 5 words are present and writing the result to the new attribute. Change the vector_creation parameter of your process documents to Binary Term Occurrences. Have a look at the attached process.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
        <process expanded="true" height="431" width="748">
          <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="this is a test text which contains an indicator word."/>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="210">
            <parameter key="text" value="blabla blubb blubb"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.2.001" expanded="true" height="112" name="Process Documents" width="90" x="179" y="30">
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true" height="639" width="757">
              <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes" width="90" x="380" y="30">
            <list key="function_descriptions">
              <parameter key="is_in_cat1" value="if(indicator == 1, &quot;yes&quot;, &quot;no&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="514" y="30">
            <list key="function_descriptions">
              <parameter key="is_in_cat2" value="if(indicator == 1 &amp;&amp; word ==1, &quot;yes&quot;, &quot;no&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.2.002" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="648" y="30">
            <list key="function_descriptions">
              <parameter key="is_in_cat3" value="if(blabla == 1, &quot;yes&quot;, &quot;no&quot;)"/>
            </list>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>