Regular Expressions II - Needle I need You.

lists
lists New Altair Community Member
edited November 2024 in Community Q&A

Dear reader, this is a continuation of this thread:

 

http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-expressions/m-p/37750#M26008

 

I opened a new one- because the original thread is solved.

 

I try to grep a word of a text-value to generate a value with the Generate Attribute oparator.

 

replaceAll(haystack, "hay (.*) hay", "$1")

 

...doesn't work. I guess I need needle as result, but it never appears in $1-$3 probably because it's replaced.

Alternative methods like match() and find() only deliver true/false.

 

Is there an example expression and function which collects an arbitrary word of an attribute containing text, to provide a value for a new attribute (without text-extension)?

 

needle.png

 

Needle I need you - Song!

https://www.youtube.com/watch?v=rNS6D4hSQdA

 

 

 

Tagged:

Best Answer

  • Edin_Klapic
    Edin_Klapic New Altair Community Member
    Answer ✓

    @BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer

     

    Finally:

     

    Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="haystack" value="&quot;SELECT * FROM&#13;&#10;myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
    <parameter key="macro" value="searchword"/>
    <parameter key="value" value="needle"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
    <parameter key="macro" value="regexTablename"/>
    <parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
    <parameter key="macro" value="regexLimit"/>
    <parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
    <list key="function_descriptions">
    <parameter key="searchword extraction" value="if(matches(haystack,&quot;(?di).*&quot;+%{searchword}+&quot;.*&quot;),replaceAll(haystack,&quot;(?di).*(&quot;+%{searchword}+&quot;).*&quot;,&quot;$1&quot;),haystack)"/>
    <parameter key="test if searchword matches" value="matches(haystack,&quot;(?di).*needle.*&quot;)"/>
    <parameter key="tablename" value="replaceAll(haystack,%{regexTablename},&quot;$1&quot;)"/>
    <parameter key="limit" value="replaceAll(haystack,%{regexLimit},&quot;$1&quot;)"/>
    </list>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
    <connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
    <connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
    <connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    image.png

    As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.

     

    Resulting ExampleSet:

    image.png

     

    The string which is tested (i.e. the content of the attribute haystack) is

    SELECT * FROM
    myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

     

    The regex for the tablename is

    (?di).*from\s+(\S+)\s+order.*

     

    The regex for the limit is

    (?di).*limit\s+(\S+)\s+offset.*

     

    Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.

     

    Afterwards you can use Extract Macro for getting a variable with the desired value.

     

    Hope this works for you :)

     

    Best regards,

    Edin

     

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Oh no, you started a new thread and didn't tag anyone. They probably won't see this right away. 

  • lists
    lists New Altair Community Member

    Flexibility Requirement:

     

    Let's take another source text (real live example).

     source:

    SELECT * FROM
    myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

     

    Now I want to retrieve myNeedleTable and later 30.

     regex:

    (?mi)(from)(.*)(order)

     

    ..shows a strange Result preview, though Result List shows the right Group Matches.

    That's why it replaces the matching string. If I invert the selection it does not match myNeedleTable.

     

    I think I have to turn 

    (?mi)(from)(.*)(order)

    ...into its opposite.

     

    (?!(from)(.*)(order))

    ...does not work.

  • Edin_Klapic
    Edin_Klapic New Altair Community Member

    Tadaa :)

     

    image.png

     

    Or separately

    (?mi).*from\s?(\S*).*
    (?is).*limit\s?(\S*).*

     

  • lists
    lists New Altair Community Member

     

    You are a Regex Guru.

     

    How often do you do regex- every day?

     

    (?mius)((?!(from)(.*)(order)).) left me one "F" to much.

     

    I will choose your solution. Thank you!

  • lists
    lists New Altair Community Member

     

    I get other results then you...strange...

    strange.png

     

    Something funny:

     

    SELECT *  
    FROM My_other_table order by drDateDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM My_other_table)-33

    strange2.png

     

    I think since RM seems to have no real matching function native (without extension),

    I should consider if Regex is the right approach for my pipieline. I would need it as much flexible as possible. 

    Performance would also be a considerable point. The detour via replace seems a little bit 'overkill' too.

     

    One of the functions matches() or find() should return needle. That would be the real smart deal.  

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member

    In your first picture, you use \s?. This means "0 or 1 whitespace characters". The line feed is 1 or even 2 characters depending on your operating system.

     

    I always use \s+ so I'm independent of the number of whitespace characters. If it is syntactically possible to leave out the whitespace, you could also use \s*. 

     

    Regards,

     

    Balázs

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Just an FYI, the Generate Attributes operator does have contains() and matches() functions. You can also you use RegEx in there and create new columns (attributes).

  • lists
    lists New Altair Community Member

    Thank you, Do you have an example where contains() or matches() returns needle (the searched word)?

    My search term isn't static so "(.*)".

  • Edin_Klapic
    Edin_Klapic New Altair Community Member
    Answer ✓

    @BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer

     

    Finally:

     

    Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="haystack" value="&quot;SELECT * FROM&#13;&#10;myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
    <parameter key="macro" value="searchword"/>
    <parameter key="value" value="needle"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
    <parameter key="macro" value="regexTablename"/>
    <parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
    <parameter key="macro" value="regexLimit"/>
    <parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
    <list key="function_descriptions">
    <parameter key="searchword extraction" value="if(matches(haystack,&quot;(?di).*&quot;+%{searchword}+&quot;.*&quot;),replaceAll(haystack,&quot;(?di).*(&quot;+%{searchword}+&quot;).*&quot;,&quot;$1&quot;),haystack)"/>
    <parameter key="test if searchword matches" value="matches(haystack,&quot;(?di).*needle.*&quot;)"/>
    <parameter key="tablename" value="replaceAll(haystack,%{regexTablename},&quot;$1&quot;)"/>
    <parameter key="limit" value="replaceAll(haystack,%{regexLimit},&quot;$1&quot;)"/>
    </list>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
    <connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
    <connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
    <connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    image.png

    As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.

     

    Resulting ExampleSet:

    image.png

     

    The string which is tested (i.e. the content of the attribute haystack) is

    SELECT * FROM
    myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

     

    The regex for the tablename is

    (?di).*from\s+(\S+)\s+order.*

     

    The regex for the limit is

    (?di).*limit\s+(\S+)\s+offset.*

     

    Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.

     

    Afterwards you can use Extract Macro for getting a variable with the desired value.

     

    Hope this works for you :)

     

    Best regards,

    Edin

     

  • lists
    lists New Altair Community Member

    @Edin_Klapic 

     

    Wow thank you. That probably was a lot of work.

     

    As I said before, I changed my strategy/pipeline.

    I found initial attributing/tagging important stuff (with 'talking declarations') while data preparation, makes much more sense in my case, then extracting informations via replace() to generate attrbutes later in the pipeline. My data source is now "normalized" and optimized.

     

    I'm still not convinced that replacing should substitute a generic matching in most cases. 

     

    But I'm sure there will be other cases, when this example could be helpful. 

    So a great contribution to the community.