Question about stopword list and word stemming (german)

thomas_wiedmann
thomas_wiedmann New Altair Community Member
edited November 5 in Community Q&A

This is my first try to use Stopword Filter (german) and word stemming (german). I try to understand whats going on. I put some (german) Text inside. Result Input und Output looks nearly like the same. So I get some questions:

 

Input:

Dies ist ein Text mit einigen Worten und einem Punkt. Gestern bin ich gegangen, morgen werde ich gehen.

Output:

 

 

dies ist ein text mit einigen worten und einem punkt. gestern bin ich gegangen, morgen werde ich gehen. 

 

a) Is there a list of which words are filtered by the stopword filter operator?

b) What is Stem (German) do?

 

RapidMiner.JPG

 

Process

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="136">
<parameter key="text" value="Dies ist ein Text mit einigen Worten und einem Punkt. Gestern bin ich gegangen, morgen werde ich gehen."/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="nominal"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="380" y="136">
<parameter key="stop_word_list" value="Standard"/>
</operator>
<operator activated="true" class="text:stem_german" compatibility="8.1.000" expanded="true" height="68" name="Stem (German)" width="90" x="581" y="136"/>
<connect from_op="Create Document" from_port="output" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (German)" to_port="document"/>
<connect from_op="Stem (German)" from_port="document" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

 

Thanks!

Thomas

 

 

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    I believe you need to use these operators inside a larger "process documents" operator where you perform tokenizing first, so they have some discrete word tokens to operate on.  Currently these operators are not doing anything because they are trying to operate on the entire document text at once, which is not possible for either stemming or stopword removal.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @Telcontar120 beat me to it! Here's a sample process @thomas_wiedmann

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="136">
    <parameter key="text" value="Dies ist ein Text mit einigen Worten und einem Punkt. Gestern bin ich gegangen, morgen werde ich gehen."/>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="380" y="136">
    <parameter key="keep_text" value="true"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
    <operator activated="true" class="text:filter_stopwords_german" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="246" y="85"/>
    <operator activated="true" class="text:stem_german" compatibility="8.1.000" expanded="true" height="68" name="Stem (German)" width="90" x="447" y="85"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
    <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (German)" to_port="document"/>
    <connect from_op="Stem (German)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • thomas_wiedmann
    thomas_wiedmann New Altair Community Member

    Ok, I try this one...

     

    RapidMiner4.JPG

     

    Result:

    tex wor punk gegang 

    Uuuh. True, but first have to meditate about this result...   ;-)

    I had very much to learn...

     

    Thanks!

    Thomas