nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Siemens Community Catalyst Program

The Siemens Community Catalyst program was co-created with our community to acknowledge technology leaders who consistently contribute to the Siemens Community. Nominations are accepted on a rolling basis.

Nominate Now

whitespace regular expression filter tokens by content

student24

Hello everybody,

I want to search words from documents. I use the operator Filter Tokens by content with regular expression. If I want to search more than one word I use word1|word2|...|wordn. Now my question is how can I search an expression where there is a whitespace? For example "Research and Development|Word2|Word3 etc. ". Is there any wildcard for whitespaces?

Thanks for your help

Find more posts tagged with

AI Studio

Accepted answers

All comments

RalfKlinkenberg

You can use

[tt]\s[/tt] as a placeholder for a whitespace character,
[tt]\s+[/tt] for one or more whitespace characters, and
[tt]\s*[/tt] for zero, one, or more whitespace characters.
[tt]\t[/tt] is a placeholder for tabulator symbols.

RapidMiner regular expressions use the Java syntax for regular expressions. If you search for "[tt]Java regular expressions[/tt]" with Google or another search engine, you will find a lot of documentation.

Best wishes,
Ralf

student24

Thank you very much for your reply.

I have tried these out before but it doesnt work. There are no results in the word list although the expression is in the document. I dont know what I'm doing wrong. Do you know if it works when I'm examining pdf files?

Thanks

RalfKlinkenberg

If you post the XML code of your RapidMiner process here, there is a chance that someone in the forum maybe able to help.

Without being able to see the RapidMiner process, we can only guess where the problem in your RapidMiner process might be.

student24

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
<list key="text_directories">
<parameter key="test" value="C:"/>
</list>
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="380" y="30"/>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="514" y="30">
<parameter key="condition" value="matches"/>
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
<connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

fras

For some reason your XML is not valid but the important line is this:
<parameter key="regular_expression" value="research\sand\sdevelopment"/>
If you search for "Research..." this Regex will fail because upper/lower case
matters unless you ignore it by applying the regex switch "i".
It will fail also if there are more than one whitespaces between the words.

student24

I thought I ingnore the upper/lower case by the operator "Transform Cases" and select the option "lower case". For more than one whitespace I could use \s+ but it also doesnt work.
Why is my XML not valid?

MariusHelf

Corrected XML above.

student24

ok thank you. I copied it but the problem with the whitespaces isnt solved. I dont know what Im doing wrong.
Is there maybe another operator or another way I can search for expression in documents?