🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Remove Unwanted Words from List

User: "ronmac"
New Altair Community Member
Updated by Jocelyn
I would like to remove unwanted words from this project I am working on. I figured out I can use the Remove Documents Operator to get rid of "http" from my results. I have more words I would like to filter out. For example, "chart", "twitter", "trade", "message" etc. Can someone explain how I can expand the list of words to filter out. I would like to have the flexibility to make changes to the list as needed based on the search results. Also is the Stem Operator required for what I am doing here?

Thanks
Ron McEwan
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
    <process expanded="true" height="251" width="413">
      <operator activated="true" class="web:read_rss" compatibility="5.0.4" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
        <parameter key="url" value="http://stocktwits.com/streams/all?rss=true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.0.7" expanded="true" height="76" name="Process Documents from Data" width="90" x="45" y="165">
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="408" width="570">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.0.7" expanded="true" height="60" name="Remove Document Parts" width="90" x="45" y="30">
            <parameter key="deletion_regex" value="http"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="45" y="165"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.0.7" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="165"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.0.7" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="165">
            <parameter key="max_chars" value="10"/>
          </operator>
          <operator activated="true" class="text:stem_snowball" compatibility="5.0.7" expanded="true" height="60" name="Stem (Snowball)" width="90" x="45" y="300"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.7" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="300"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.0.7" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="313" y="300">
            <parameter key="max_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Remove Document Parts" to_port="document"/>
          <connect from_op="Remove Document Parts" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="mututal_information_matrix" compatibility="5.0.11" expanded="true" height="76" name="Mututal Information Matrix" width="90" x="277" y="219"/>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.0.7" expanded="true" height="76" name="WordList to Data" width="90" x="246" y="30"/>
      <connect from_op="Read RSS Feed" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Mututal Information Matrix" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="Mututal Information Matrix" from_port="example set" to_port="result 3"/>
      <connect from_op="Mututal Information Matrix" from_port="matrix" to_port="result 4"/>
      <connect from_op="WordList to Data" from_port="word list" to_port="result 1"/>
      <connect from_op="WordList to Data" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Find more posts tagged with

Sort by:
1 - 5 of 51
    User: "Rene"
    New Altair Community Member
    i never tried and i'm no RM-connaisseur. but i think you could e.g. use regular expressions to get rid of a short list of words: "http|chart|twitter". or create your own list of stop words and refer to it with a stopword-filter operator when you are working on tokens. "stemming" refers to reducing words to its roots - 'solicited', 'solicitation', 'unsolicited' etc. may e.g. all result in 'solicit' by using a stemming-algorithm.
    User: "B_"
    New Altair Community Member
    In text processing, filter stopwords (dictionary) uses a file for "personal stopwords."   
    User: "ronmac"
    New Altair Community Member
    OP
    Thanks for the help. The personal dictionary was exactly what I ineeded. Now I can modify this list as necessary. I also removed the Stem Operator. I had misunderstood it's application. 
    User: "gunjanamit"
    New Altair Community Member
    How we can modify the dictionary?
    User: "rajbanokhan"
    New Altair Community Member

    hi

    there are so many words from 675 pages

    how to reduce words from list and i wanted only 30 to 40 important words