whitespace regular expression filter tokens by content

student24
student24 New Altair Community Member
edited November 5 in Community Q&A
Hello everybody,

I want to search words from documents. I use the operator Filter Tokens by content with regular expression. If I want to search more than one word I use word1|word2|...|wordn. Now my question is how can I search an expression where there is a whitespace? For example "Research and Development|Word2|Word3 etc. ". Is there any wildcard for whitespaces?

Thanks for your help
Tagged:

Answers

  • RalfKlinkenberg
    RalfKlinkenberg New Altair Community Member
    You can use
    • [tt]\s[/tt]  as a placeholder for a whitespace character,
    • [tt]\s+[/tt]  for one or more whitespace characters, and
    • [tt]\s*[/tt]  for zero, one, or more whitespace characters.
    • [tt]\t[/tt]  is a placeholder for tabulator symbols.
    RapidMiner regular expressions use the Java syntax for regular expressions. If you search for "[tt]Java regular expressions[/tt]" with Google or another search engine, you will find a lot of documentation.

    Best wishes,
    Ralf
  • student24
    student24 New Altair Community Member
    Thank you very much for your reply.

    I have tried these out before but it doesnt work. There are no results in the word list although the expression is in the document. I dont know what I'm doing wrong. Do you know if it works when I'm examining pdf files?

    Thanks
  • RalfKlinkenberg
    RalfKlinkenberg New Altair Community Member
    If you post the XML code of your RapidMiner process here, there is a chance that someone in the forum maybe able to help.

    Without being able to see the RapidMiner process, we can only guess where the problem in your RapidMiner process might be.  ;)
  • student24
    student24 New Altair Community Member
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
           <list key="text_directories">
             <parameter key="test" value="C:"/>
           </list>
           <parameter key="keep_text" value="true"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
             <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="380" y="30"/>
             <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="514" y="30">
               <parameter key="condition" value="matches"/>
               <parameter key="regular_expression" value="research\sand\sdevelopment"/>
             </operator>
             <connect from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
             <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
         <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
         <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="source_input 2" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>

  • fras
    fras New Altair Community Member
    For some reason your XML is not valid but the important line is this:
      <parameter key="regular_expression" value="research\sand\sdevelopment"/>
    If you search for "Research..." this Regex will fail because upper/lower case
    matters unless you ignore it by applying the regex switch "i".
    It will fail also if there are more than one whitespaces between the words.
  • student24
    student24 New Altair Community Member
    I thought I ingnore the upper/lower case by the operator "Transform Cases" and select the option "lower case". For more than one whitespace I could use \s+ but it also doesnt work.
    Why is my XML not valid? :)
  • MariusHelf
    MariusHelf New Altair Community Member
    Corrected XML above.
  • student24
    student24 New Altair Community Member
    ok thank you. I copied it but the problem with the whitespaces isnt solved. I dont know what Im doing wrong.
    Is there maybe another operator or another way I can search for expression in documents?