[SOLVED] Filter text from a list of word

johan_CG
johan_CG New Altair Community Member
edited November 2024 in Community Q&A
Hi everybody,

I build a process to search and count a list of keywords in thousands of files.
I built the keywords list from a Excel file after seraval operations in an example set with a keyword by example.

I would like to be able to do something like an inverse of "Filter stopwords (Dictionary)" using the attribute of my example set (or a word list if someone can explain me how to convert an example set attribute into a word list).

I found the following topics but I don't know if there is something new since: In these topics they talk about using "Filter Tokens (by Content)" operator with "matches" and the words in the regular expression but I can't use this solution because I have tens of keywords list with hundreds of keywords each. They talk also about modify the source code "Filter stopwords (Dictionary)" operator, is somebody able to tell me where I can find the source code of the operator and how to install my own operator in RapidMiner?

Thanks in advance
Johan

Answers

  • venkatesh
    venkatesh New Altair Community Member
    Do you mean that you saved the list of keywords as example set and each example (row) is a keyword? If yes you could look the process below to see how to convert a example set in to a word list.


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.006" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.006" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="112" y="30">
            <list key="attribute_values">
              <parameter key="word" value="&quot;word_1&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_data_user_specification" compatibility="5.3.006" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="112" y="120">
            <list key="attribute_values">
              <parameter key="word" value="&quot;word_2&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" breakpoints="after" class="union" compatibility="5.3.006" expanded="true" height="76" name="Union" width="90" x="246" y="30"/>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.006" expanded="true" height="76" name="Nominal to Text" width="90" x="380" y="30"/>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.001" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="255">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.001" expanded="true" height="94" name="Process Documents" width="90" x="447" y="210">
            <parameter key="vector_creation" value="Term Frequency"/>
            <process expanded="true">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Union" to_port="example set 1"/>
          <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Union" to_port="example set 2"/>
          <connect from_op="Union" from_port="union" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • johan_CG
    johan_CG New Altair Community Member
    Hi Venkatesh

    Thank you for your reply.
    To begin my work I've a table looking like the following:
    DomainSub-domainItemKeywords
    Domain 1Sub-domain 1Item 1KW_1, KW_2,
    Domain 1Sub-domain 1Item 2KW_3,KW_4, KW_5, KW_6...
    By using RapidMiner I transformed this table like this:
    IDItemKeyword
    id_1item_1KW_1
    id_1item_1KW_2
    id_1item_2KW_3
    I have to filter all documents stored in a folder using the keywords, that's why I needed an operator like the inverse of "Filter Stopwords (Dictionary)" operator.
    But "Filter Stopwords (Dictionary)" operator uses a txt file as dictionary.

    Finally to solve my problem, I created a new operator "Filter Startword (Dictionary)" by removing the '!' in the class "StopwordOperator" at line 74.
    Regarding the list of word (and not WordList) I used the following operator
    • "Set Role" to remove the ID as special attribute
    • "Select Attributes" with the "Single" parameter to keep only the keywords
    • "Write CSV" with a space as column separator and I connected the "file" output
    I hope I am not too confus in my explaination.

    Greetings
    Johan

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.