problem with Filter stop word operator

Mohamad1367
Mohamad1367 New Altair Community Member
edited November 5 in Community Q&A
Hi .i am working on a sentiment analysis project in persian language  and installed the rosette extension for some text preproccesing purpose in this language such as tokenization .
i have a problem with filter stop word(dictionary) operator...when i apply this operator to my data set( after tokenization) i recieve only tokenized data set without filtering stop words...what is the cause of this problem?

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    For this to work, you need to supply a dictionary file in the 2nd input port of "Filter Stopwords (Dictionary)" operator.  The way it works is it screens out words that are in the dictionary file.  Since you are not supplying it with any dictionary file, then it is not filtering anything.

  • Mohamad1367
    Mohamad1367 New Altair Community Member
    edited June 2020
    Thank you for your answer @Telcontar120
    Is it possible to share the example procces with me so that I can understand it better?thanks very much dear

  • Mohamad1367
    Mohamad1367 New Altair Community Member
    @Telcontar120 i connected the open file operator to fil input of stop word  and attached the stop word dictionary to that but it didn't work ...is this what you mean in previous comment?

  • Telcontar120
    Telcontar120 New Altair Community Member
    Sorry, I don't read Persian so I am not able to make much of the data files.  But yes, you should be able to do this with the Open File operator.  You can also just directly specify the file in the parameters of the Stopwords Dictionary operator, where there is a place to specify the path to the file you want to use.
    A simplified process that works is attached, you would just need to swap your file paths and names.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="45" y="85">
            <parameter key="file" value="C:\Users\brian\Google Drive\RapidMiner\Training Text Mining\SourceData\Room Service Reviews\food_swissotel_chicago.2.gold.txt"/>
            <parameter key="extract_text_only" value="true"/>
            <parameter key="use_file_extension_as_type" value="true"/>
            <parameter key="content_type" value="txt"/>
            <parameter key="encoding" value="SYSTEM"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="103" name="Process Documents" width="90" x="246" y="85">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="open_file" compatibility="9.6.000" expanded="true" height="68" name="Open File" width="90" x="112" y="136">
                <parameter key="resource_type" value="file"/>
                <parameter key="filename" value="C:\Users\brian\Downloads\stopwords.txt"/>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="9.3.001" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="34">
                <parameter key="case_sensitive" value="false"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Open File" from_port="file" to_op="Filter Stopwords (Dictionary)" to_port="file"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    
    This should be enough to get you started.  You can of course do more with the processing of the documents if you desire.