[SOLVED] How to prevent some words to be stemmed?
yakito
New Altair Community Member
Hello,
I am having some problems with the Stem operator. It looks like the operator is stemming some brand names which contain a dictionary word in it. The result of that is that my text classification model is not as accurate as I think it would be if the brand name is maintained as it is.
Is there any way I can apply the stemm operator to all words except from some?
Thanks a lot for any tip in the right direction, I am pretty new with RapidMiner so excuse me if this is some basic question.
Thanks again.
I am having some problems with the Stem operator. It looks like the operator is stemming some brand names which contain a dictionary word in it. The result of that is that my text classification model is not as accurate as I think it would be if the brand name is maintained as it is.
Is there any way I can apply the stemm operator to all words except from some?
Thanks a lot for any tip in the right direction, I am pretty new with RapidMiner so excuse me if this is some basic question.
Thanks again.
Tagged:
0
Answers
-
Hi,
you could replace the brand names before stemming by something which will not get stemmed, and rename them back afterwards. Have a look at the process below on how to prevent a brand name called "testing" from being stemmed.
Best,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
<process expanded="true" height="639" width="757">
<operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
<parameter key="text" value="this is a cool test text about testing texts."/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="234" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.2.001" expanded="true" height="60" name="Replace Tokens" width="90" x="380" y="30">
<list key="replace_dictionary">
<parameter key="^testing$" value="_testing_brand_"/>
</list>
</operator>
<operator activated="true" class="text:stem_porter" compatibility="5.2.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="514" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.2.001" expanded="true" height="60" name="Replace Tokens (2)" width="90" x="648" y="30">
<list key="replace_dictionary">
<parameter key="^_testing_brand_$" value="testing"/>
</list>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
<connect from_op="Replace Tokens" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
<connect from_op="Replace Tokens (2)" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Perfect, thanks a lot! Why I didn't think of that
0