Remove URL from document

fokko
fokko New Altair Community Member
edited November 5 in Community Q&A
Hello,
I have a problem with my text pre processing. Maybe anyone can help me :)

My text looks like this:

T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.
Commodities
Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.
Europe
European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.
[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:



I want to remove the URLs from the text. How can I do this?I think filter tokens does not work?! Is the solution Remove Document parts?

I think the solution should look like this rule: if the word starts with http. or www. then delete the word from the text..... (but only the url of the text)



Kind regards

Answers

  • homburg
    homburg New Altair Community Member
    Hi fokko,

    depending on your setup you might use "Replace" (for example sets) or "Replace Tokens" (for tokenized documents) and use a regex like this: \[\d*\][^\[\]]* to identify all url links from your text input.

    Cheers,
    Helge
  • fokko
    fokko New Altair Community Member
    Thanks for your response. But I can´t solve my problem. I don´t understand the regex command. If I want to delete the words from the text which beginn with http. , what is the regex? and what is the configuration for the operator?

    To sovle the problem, my setup only consists of process document from files and then I tried replace for example sets.

    I dont tokenize in my setup. (If I tokenize a URL like www.helpme.com , I would have www help me com. So If I search for www , I cannot delete the complete URL.

    Thank you for comments
  • homburg
    homburg New Altair Community Member
    Hi!

    You don't need to. Please have a look:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Load Text" width="90" x="112" y="30">
            <parameter key="text" value="T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.&#10;Commodities&#10;Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.&#10;Europe&#10;European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.&#10;[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:"/>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens" width="90" x="447" y="30">
            <list key="replace_dictionary">
              <parameter key="\[\d*\][^\[\]]*" value="!!REPLACED!! "/>
            </list>
          </operator>
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Load Text (2)" width="90" x="112" y="120">
            <parameter key="text" value="T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.&#10;Commodities&#10;Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.&#10;Europe&#10;European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.&#10;[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:"/>
          </operator>
          <operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts" width="90" x="447" y="120">
            <parameter key="deletion_regex" value="\[\d*\][^\[\]]*"/>
          </operator>
          <connect from_op="Load Text" from_port="output" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_port="result 1"/>
          <connect from_op="Load Text (2)" from_port="output" to_op="Remove Document Parts" to_port="document"/>
          <connect from_op="Remove Document Parts" from_port="document" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Cheers,
    Helge
  • fokko
    fokko New Altair Community Member
    Thanks, I get now an idea of the process.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
            <list key="text_directories">
              <parameter key="test" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Beispiele"/>
            </list>
            <parameter key="use_file_extension_as_type" value="false"/>
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <process expanded="true">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.0.003" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
            <parameter key="attribute_name" value="text"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.0.003" expanded="true" height="94" name="Multiply" width="90" x="313" y="30"/>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="255">
            <list key="text_directories">
              <parameter key="positive" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Positive"/>
            </list>
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="313" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Stem (2)" to_port="document"/>
              <connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="255">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.0.003" expanded="true" height="76" name="Generate Aggregation" width="90" x="313" y="255">
            <parameter key="attribute_name" value="positive"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.0.003" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="255">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="metadata_path|text|positive|label|metadata_date|metadata_file"/>
          </operator>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (3)" width="90" x="45" y="345">
            <list key="text_directories">
              <parameter key="negative" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Negative"/>
            </list>
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="180" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (3)" width="90" x="416" y="30"/>
              <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
              <connect from_op="Tokenize (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
              <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="179" y="345">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (4)" width="90" x="180" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (4)" width="90" x="484" y="30"/>
              <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
              <connect from_op="Tokenize (4)" from_port="document" to_op="Stem (4)" to_port="document"/>
              <connect from_op="Stem (4)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.0.003" expanded="true" height="76" name="Generate Aggregation (2)" width="90" x="313" y="345">
            <parameter key="attribute_name" value="negative"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.0.003" expanded="true" height="76" name="Select Attributes (2)" width="90" x="447" y="345">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="metadata_path|text|negative|label|metadata_date|metadata_file"/>
          </operator>
          <operator activated="true" class="join" compatibility="6.0.003" expanded="true" height="76" name="Join" width="90" x="581" y="300">
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="metadata_path" value="metadata_path"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="6.0.003" expanded="true" height="76" name="Generate Attributes" width="90" x="715" y="300">
            <list key="function_descriptions">
              <parameter key="Sentiment" value="(positive-negative)/(positive+negative)"/>
            </list>
          </operator>
          <operator activated="true" class="write_excel" compatibility="6.0.003" expanded="true" height="76" name="Write Excel" width="90" x="715" y="435">
            <parameter key="excel_file" value="C:\Users\chris_000\Desktop\Output.xls"/>
            <parameter key="file_format" value="xlsx"/>
            <parameter key="sheet_name" value="RapidMiner Test"/>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Aggregation" to_port="example set input"/>
          <connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Process Documents from Files (3)" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
          <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Generate Aggregation (2)" to_port="example set input"/>
          <connect from_op="Generate Aggregation (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>


    But I have still the problem to implement your help into my process, because I have other operators. The load text doesn´t work for my setup beacuse I have lots of texts.

    The purpose of my process is to find "positive" or "negative" words in text documents (.txt)

    Another question: if I want to filter out any documents with the term  "via twitter. How can I do this? I tried filter examples. My Setup only works for one word, but not for two or a term.

    Best
  • homburg
    homburg New Altair Community Member
    Hi fokko,

    here is your process with an alternative input chain in it showing you how to attach the filter techniques. It is useful not to convert your data to the example set format too early as this limits your options to filter and replace tokens or documents.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
           <list key="text_directories">
             <parameter key="test" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Beispiele"/>
           </list>
           <parameter key="use_file_extension_as_type" value="false"/>
           <parameter key="create_word_vector" value="false"/>
           <parameter key="keep_text" value="true"/>
           <process expanded="true">
             <connect from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="set_role" compatibility="6.0.008" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
           <parameter key="attribute_name" value="text"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="6.0.008" expanded="true" height="94" name="Multiply" width="90" x="313" y="30"/>
         <operator activated="true" class="loop_files" compatibility="6.0.008" expanded="true" height="76" name="Load &amp; Delete URLs" width="90" x="514" y="75">
           <process expanded="true">
             <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="112" y="30"/>
             <operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Delete URLs" width="90" x="313" y="30">
               <parameter key="deletion_regex" value="\[\d*\][^\[\]]*"/>
             </operator>
             <connect from_port="file object" to_op="Read Document" to_port="file"/>
             <connect from_op="Read Document" from_port="output" to_op="Delete URLs" to_port="document"/>
             <connect from_op="Delete URLs" from_port="document" to_port="out 1"/>
             <portSpacing port="source_file object" spacing="0"/>
             <portSpacing port="source_in 1" spacing="0"/>
             <portSpacing port="sink_out 1" spacing="0"/>
             <portSpacing port="sink_out 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="255">
           <list key="text_directories">
             <parameter key="positive" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Positive"/>
           </list>
           <parameter key="vector_creation" value="Binary Term Occurrences"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="313" y="30"/>
             <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="447" y="30"/>
             <connect from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Stem (2)" to_port="document"/>
             <connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="255">
           <parameter key="vector_creation" value="Term Occurrences"/>
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
             <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
             <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
             <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
             <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="generate_aggregation" compatibility="6.0.008" expanded="true" height="76" name="Generate Aggregation" width="90" x="313" y="255">
           <parameter key="attribute_name" value="positive"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="255">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="metadata_path|text|positive|label|metadata_date|metadata_file"/>
         </operator>
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (3)" width="90" x="45" y="345">
           <list key="text_directories">
             <parameter key="negative" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Negative"/>
           </list>
           <parameter key="vector_creation" value="Binary Term Occurrences"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="180" y="30"/>
             <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (3)" width="90" x="416" y="30"/>
             <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
             <connect from_op="Tokenize (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
             <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="179" y="345">
           <parameter key="vector_creation" value="Term Occurrences"/>
           <parameter key="keep_text" value="true"/>
           <list key="specify_weights"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (4)" width="90" x="180" y="30"/>
             <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (4)" width="90" x="484" y="30"/>
             <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
             <connect from_op="Tokenize (4)" from_port="document" to_op="Stem (4)" to_port="document"/>
             <connect from_op="Stem (4)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="generate_aggregation" compatibility="6.0.008" expanded="true" height="76" name="Generate Aggregation (2)" width="90" x="313" y="345">
           <parameter key="attribute_name" value="negative"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="6.0.008" expanded="true" height="76" name="Select Attributes (2)" width="90" x="447" y="345">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="metadata_path|text|negative|label|metadata_date|metadata_file"/>
         </operator>
         <operator activated="true" class="join" compatibility="6.0.008" expanded="true" height="76" name="Join" width="90" x="581" y="300">
           <parameter key="use_id_attribute_as_key" value="false"/>
           <list key="key_attributes">
             <parameter key="metadata_path" value="metadata_path"/>
           </list>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.0.008" expanded="true" height="76" name="Generate Attributes" width="90" x="715" y="300">
           <list key="function_descriptions">
             <parameter key="Sentiment" value="(positive-negative)/(positive+negative)"/>
           </list>
         </operator>
         <operator activated="true" class="write_excel" compatibility="6.0.008" expanded="true" height="76" name="Write Excel" width="90" x="715" y="435">
           <parameter key="excel_file" value="C:\Users\chris_000\Desktop\Output.xls"/>
           <parameter key="file_format" value="xlsx"/>
           <parameter key="sheet_name" value="RapidMiner Test"/>
         </operator>
         <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Twitter" width="90" x="648" y="75">
           <parameter key="string" value="via twitter"/>
         </operator>
         <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="782" y="75">
           <process expanded="true">
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Process Documents from Files" from_port="example set" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
         <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data (2)" to_port="example set"/>
         <connect from_op="Load &amp; Delete URLs" from_port="out 1" to_op="Filter Twitter" to_port="documents 1"/>
         <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
         <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Aggregation" to_port="example set input"/>
         <connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
         <connect from_op="Process Documents from Files (3)" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
         <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Generate Aggregation (2)" to_port="example set input"/>
         <connect from_op="Generate Aggregation (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
         <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
         <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
         <connect from_op="Filter Twitter" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
       </process>
     </operator>
    </process>
    Cheers,
    Helge
  • fokko
    fokko New Altair Community Member
    Thank you a lot. :) Great support Helge

    I tried to add other text filters in this setup. So I want to filter texts with: GROK-126315. How can I implement this? I tried some ways (multiply, the same process again) but it doesn´t work up to now. [ Update: I found a basic way to solve this)

    Besides my implemention of the new input chain doesn´t work
  • homburg
    homburg New Altair Community Member
    Hi fokko,

    can you send me a process or provide more information regarding your current issues? If you want to filter more tokens just add another filter to your process (like the one for twitter). You may also set some regular expressions with your filters to reduce the amount of operators.

    Cheers,
    Helge
  • fokko
    fokko New Altair Community Member
    The problem is to implement the input chain for the text data. I built the setup but it has an error and I don´t really know how to fix it.



    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="255">
            <list key="text_directories">
              <parameter key="positive" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Positive"/>
            </list>
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="313" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Stem (2)" to_port="document"/>
              <connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files (3)" width="90" x="45" y="345">
            <list key="text_directories">
              <parameter key="negative" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Dictionary\General Inquirer\Negative"/>
            </list>
            <parameter key="vector_creation" value="Binary Term Occurrences"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="180" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (3)" width="90" x="416" y="30"/>
              <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
              <connect from_op="Tokenize (3)" from_port="document" to_op="Stem (3)" to_port="document"/>
              <connect from_op="Stem (3)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_files" compatibility="6.0.003" expanded="true" height="76" name="Loop Files" width="90" x="45" y="30">
            <parameter key="directory" value="C:\Users\chris_000\Desktop\Master Doks\Arbeitsstand\Textdaten\Pre\Apple\Split"/>
            <process expanded="true">
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="45" y="75">
            <parameter key="string" value="via twitter"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (2)" width="90" x="45" y="120">
            <parameter key="string" value="via twitter"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="246" y="30">
            <parameter key="keep_text" value="true"/>
            <process expanded="true">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.0.003" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
            <parameter key="attribute_name" value="text"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="6.0.003" expanded="true" height="94" name="Multiply" width="90" x="581" y="30"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="255">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="179" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.0.003" expanded="true" height="76" name="Generate Aggregation" width="90" x="313" y="255">
            <parameter key="attribute_name" value="positive"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.0.003" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="255">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="metadata_path|text|positive|label|metadata_date|metadata_file"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="179" y="345">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <parameter key="keep_text" value="true"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (4)" width="90" x="180" y="30"/>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (4)" width="90" x="484" y="30"/>
              <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
              <connect from_op="Tokenize (4)" from_port="document" to_op="Stem (4)" to_port="document"/>
              <connect from_op="Stem (4)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="6.0.003" expanded="true" height="76" name="Generate Aggregation (2)" width="90" x="313" y="345">
            <parameter key="attribute_name" value="negative"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.0.003" expanded="true" height="76" name="Select Attributes (2)" width="90" x="447" y="345">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="metadata_path|text|negative|label|metadata_date|metadata_file"/>
          </operator>
          <operator activated="true" class="join" compatibility="6.0.003" expanded="true" height="76" name="Join" width="90" x="581" y="300">
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="metadata_path" value="metadata_path"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="6.0.003" expanded="true" height="76" name="Generate Attributes" width="90" x="715" y="300">
            <list key="function_descriptions">
              <parameter key="Sentiment" value="(positive-negative)/(positive+negative)"/>
            </list>
          </operator>
          <operator activated="true" class="write_excel" compatibility="6.0.003" expanded="true" height="76" name="Write Excel" width="90" x="715" y="435">
            <parameter key="excel_file" value="C:\Users\chris_000\Desktop\Output.xls"/>
            <parameter key="file_format" value="xlsx"/>
            <parameter key="sheet_name" value="RapidMiner Test"/>
          </operator>
          <connect from_op="Process Documents from Files (2)" from_port="word list" to_op="Process Documents from Data" to_port="word list"/>
          <connect from_op="Process Documents from Files (3)" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
          <connect from_op="Loop Files" from_port="out 1" to_op="Filter Documents (by Content)" to_port="documents 1"/>
          <connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Filter Documents (2)" to_port="documents 1"/>
          <connect from_op="Filter Documents (2)" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Aggregation" to_port="example set input"/>
          <connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Generate Aggregation (2)" to_port="example set input"/>
          <connect from_op="Generate Aggregation (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>

    The problem is to implement the input chain for the text data. I build the setup but it has an error and I don´t really know how to fix it.
  • fokko
    fokko New Altair Community Member
    It know works.

    Helge, thanks a lot for your whole support.
  • homburg
    homburg New Altair Community Member
    Great to hear that your process now works as expected.

    Happy Mining!