Home
Discussions
Community Q&A
whitespace regular expression filter tokens by content
student24
Hello everybody,
I want to search words from documents. I use the operator Filter Tokens by content with regular expression. If I want to search more than one word I use word1|word2|...|wordn. Now my question is how can I search an expression where there is a whitespace? For example "Research and Development|Word2|Word3 etc. ". Is there any wildcard for whitespaces?
Thanks for your help
Find more posts tagged with
AI Studio
Accepted answers
All comments
RalfKlinkenberg
You can use
[tt]\s[/tt] as a placeholder for a whitespace character,
[tt]\s+[/tt] for one or more whitespace characters, and
[tt]\s*[/tt] for zero, one, or more whitespace characters.
[tt]\t[/tt] is a placeholder for tabulator symbols.
RapidMiner regular expressions use the Java syntax for regular expressions. If you search for "[tt]Java regular expressions[/tt]" with Google or another search engine, you will find a lot of documentation.
Best wishes,
Ralf
student24
Thank you very much for your reply.
I have tried these out before but it doesnt work. There are no results in the word list although the expression is in the document. I dont know what I'm doing wrong. Do you know if it works when I'm examining pdf files?
Thanks
RalfKlinkenberg
If you post the XML code of your RapidMiner process here, there is a chance that someone in the forum maybe able to help.
Without being able to see the RapidMiner process, we can only guess where the problem in your RapidMiner process might be.
student24
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
<list key="text_directories">
<parameter key="test" value="C:"/>
</list>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="380" y="30"/>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="514" y="30">
<parameter key="condition" value="matches"/>
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
fras
For some reason your XML is not valid but the important line is this:
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
If you search for "Research..." this Regex will fail because upper/lower case
matters unless you ignore it by applying the regex switch "i".
It will fail also if there are more than one whitespaces between the words.
student24
I thought I ingnore the upper/lower case by the operator "Transform Cases" and select the option "lower case". For more than one whitespace I could use \s+ but it also doesnt work.
Why is my XML not valid?
MariusHelf
Corrected XML above.
student24
ok thank you. I copied it but the problem with the whitespaces isnt solved. I dont know what Im doing wrong.
Is there maybe another operator or another way I can search for expression in documents?
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)