Filter Stopwords (English) takes out a non-stopword token

AO1
AO1 Altair Community Member
edited November 2024 in Community Q&A

Greetings community,

I am learning to use RapidMiner to extract and to analyse occurrences of selected keywords in annual reports, prepared by commercial entities. RapidMiner works well for all the key words I study, except for one.

For some reason, Filter Stopwords (English) operator filters out word 'important' for the whole corpus of documents I study.

E.g. I have a document , where manual search shows me that it contains the following words of interest:

important - 11
importantly - 4
importance - 4

Using Process Documents from Files, with Filter Stopwords (English) operator ON, I can see only occurrences of the words 'importantly' and 'importance', having this operator OFF allows me also to extract the expected 11 occurrences of word 'important'.

I tried to change tokenizing from 'non letters' to 'linguistic tokens' option, but it did not help.

Question: Is it an (known) error?

( I don't see the </> icon to share my process )

Kind regards,

Tagged:

Best Answer

  • RolandJones
    RolandJones
    Altair Employee
    Answer ✓

    Hi @AO1 ,

    I'm able to replicate on all the versions available to me. I will see if I can find out more from the development team. In the meantime, I would suggest using Filter Stopwords (Dictionary) for more fine-grained control.

    Best,

    Roland

Answers

  • AO1
    AO1 Altair Community Member
    edited September 2024

    Process and test document added

    <?xml version="1.0" encoding="UTF-8"?><process version="10.4.001">
      <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="10.4.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="10.0.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="34">
    <list key="text_directories">
    <parameter key="test RM" value="C:/Users/ovsyanna/OneDrive - Lincoln University/My Documents/test for RM/test PDF"/>
    </list>
    <parameter key="file_pattern" value="*"/>
    <parameter key="extract_text_only" value="true"/>
    <parameter key="use_file_extension_as_type" value="true"/>
    <parameter key="content_type" value="txt"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="Term Frequency"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
    <parameter key="mode" value="non letters"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="English"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="246" y="34">
    <parameter key="transform_to" value="lower case"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="34"/>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="10.0.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="648" y="34">
    <parameter key="condition" value="contains"/>
    <parameter key="string" value="importan"/>
    <parameter key="regular_expression" value="(important)"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="invert condition" value="false"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • RolandJones
    RolandJones
    Altair Employee
    Answer ✓

    Hi @AO1 ,

    I'm able to replicate on all the versions available to me. I will see if I can find out more from the development team. In the meantime, I would suggest using Filter Stopwords (Dictionary) for more fine-grained control.

    Best,

    Roland

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.