[SOLVED] How to prevent some words to be stemmed?

yakito
yakito New Altair Community Member
edited November 5 in Community Q&A
Hello,

I am having some problems with the Stem operator. It looks like the operator is stemming some brand names which contain a dictionary word in it. The result of that is that my text classification model is not as accurate as I think it would be if the brand name is maintained as it is.

Is there any way I can apply the stemm operator to all words except from some?

Thanks a lot for any tip in the right direction, I am pretty new with RapidMiner so excuse me if this is some basic question.

Thanks again.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    you could replace the brand names before stemming by something which will not get stemmed, and rename them back afterwards. Have a look at the process below on how to prevent a brand name called "testing" from being stemmed.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.002">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
        <process expanded="true" height="639" width="757">
          <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
            <parameter key="text" value="this is a cool test text about testing texts."/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="234" y="30"/>
          <operator activated="true" class="text:replace_tokens" compatibility="5.2.001" expanded="true" height="60" name="Replace Tokens" width="90" x="380" y="30">
            <list key="replace_dictionary">
              <parameter key="^testing$" value="_testing_brand_"/>
            </list>
          </operator>
          <operator activated="true" class="text:stem_porter" compatibility="5.2.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="514" y="30"/>
          <operator activated="true" class="text:replace_tokens" compatibility="5.2.001" expanded="true" height="60" name="Replace Tokens (2)" width="90" x="648" y="30">
            <list key="replace_dictionary">
              <parameter key="^_testing_brand_$" value="testing"/>
            </list>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
          <connect from_op="Replace Tokens (2)" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • yakito
    yakito New Altair Community Member
    Perfect, thanks a lot! Why I didn't think of that :p