Remove Unwanted Words from List

ronmac
ronmac New Altair Community Member
edited November 2024 in Community Q&A
I would like to remove unwanted words from this project I am working on. I figured out I can use the Remove Documents Operator to get rid of "http" from my results. I have more words I would like to filter out. For example, "chart", "twitter", "trade", "message" etc. Can someone explain how I can expand the list of words to filter out. I would like to have the flexibility to make changes to the list as needed based on the search results. Also is the Stem Operator required for what I am doing here?

Thanks
Ron McEwan
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
    <process expanded="true" height="251" width="413">
      <operator activated="true" class="web:read_rss" compatibility="5.0.4" expanded="true" height="60" name="Read RSS Feed" width="90" x="45" y="30">
        <parameter key="url" value="http://stocktwits.com/streams/all?rss=true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.0.7" expanded="true" height="76" name="Process Documents from Data" width="90" x="45" y="165">
        <parameter key="add_meta_information" value="false"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="408" width="570">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.0.7" expanded="true" height="60" name="Remove Document Parts" width="90" x="45" y="30">
            <parameter key="deletion_regex" value="http"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="45" y="165"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.0.7" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="165"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.0.7" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="313" y="165">
            <parameter key="max_chars" value="10"/>
          </operator>
          <operator activated="true" class="text:stem_snowball" compatibility="5.0.7" expanded="true" height="60" name="Stem (Snowball)" width="90" x="45" y="300"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.0.7" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="300"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.0.7" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="313" y="300">
            <parameter key="max_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Remove Document Parts" to_port="document"/>
          <connect from_op="Remove Document Parts" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="mututal_information_matrix" compatibility="5.0.11" expanded="true" height="76" name="Mututal Information Matrix" width="90" x="277" y="219"/>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.0.7" expanded="true" height="76" name="WordList to Data" width="90" x="246" y="30"/>
      <connect from_op="Read RSS Feed" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Mututal Information Matrix" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="Mututal Information Matrix" from_port="example set" to_port="result 3"/>
      <connect from_op="Mututal Information Matrix" from_port="matrix" to_port="result 4"/>
      <connect from_op="WordList to Data" from_port="word list" to_port="result 1"/>
      <connect from_op="WordList to Data" from_port="example set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Rene
    Rene New Altair Community Member
    i never tried and i'm no RM-connaisseur. but i think you could e.g. use regular expressions to get rid of a short list of words: "http|chart|twitter". or create your own list of stop words and refer to it with a stopword-filter operator when you are working on tokens. "stemming" refers to reducing words to its roots - 'solicited', 'solicitation', 'unsolicited' etc. may e.g. all result in 'solicit' by using a stemming-algorithm.
  • B_
    B_ New Altair Community Member
    In text processing, filter stopwords (dictionary) uses a file for "personal stopwords."   
  • ronmac
    ronmac New Altair Community Member
    Thanks for the help. The personal dictionary was exactly what I ineeded. Now I can modify this list as necessary. I also removed the Stem Operator. I had misunderstood it's application. 
  • gunjanamit
    gunjanamit New Altair Community Member
    How we can modify the dictionary?
  • rajbanokhan
    rajbanokhan New Altair Community Member

    hi

    there are so many words from 675 pages

    how to reduce words from list and i wanted only 30 to 40 important words

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.