RegEx query returns only one word instead of a complete sentence

TobiTee
TobiTee New Altair Community Member
edited November 5 in Community Q&A
Hey I am new to Rapidminer and try to analyze text for my Bachelor thesis.I have already pre-processed (e.g. tokenized etc. ) the Documents  and would like to use "extract information" and regular expressions to get all sentences containing the word "Kenntnisse". 
I have already tested some expressions on regex101.com and regexr.com, all worked. 
Examples: ^.*(kenntnisse|Kenntnisse|kennt*) or (?m)^.*?(Kenntnisse).*$  But as soon as I use the query in "extract information", I only get the word "Kenntnisse", not the whole sentence / paragraph.

can anyone help me?

Thanks guys!


Best Answers

Answers

  • TobiTee
    TobiTee New Altair Community Member
    oh man... thank you for that! 
    That's almost a bit embarrassing...

    Before I ask more questions, I try a few more things and read further into the topic.

    But, i will be back  B)

  • TobiTee
    TobiTee New Altair Community Member
    Hi again, 
    so it looks like the query is working  (?i)[^.\s]*Kenntnisse*[^\n]*
    Only one (the first) match is displayed in the results, where in the editor 4 matches are displayed. 
    Is there again something I have forgotten? I thought this was achieved by the "multiline mode", but it seems to make no difference.



  • kayman
    kayman New Altair Community Member
    Could you share some of your process in demo format? it's a bit hard to see where the issue is without full visibility.
    the [^.\s] part basically means 'anything but actual dots or spaces' so it may not give you the results you need and this is probably the reason you only get the first match and multiline isn't working

    Try maybe with something like this : 

    (?i)^.*\bkenntnisse\b.*$

    The \b means word boundary, so kind of everything but a character, therefore the above kind of states 'if the word kenntnisse is between start (^) and end ($), whatever casing used, to have a match...

    Multiline mode would allow you to use this line by line, so you probably wouldn't even need the ^ and $ characters as that's considered the default then, but it never harms...

    If you want to focus on multiple words you can use following 

    (?i)^.*\b(?:kenntnisse|other_word|something_else)\b.*$

    the (?: xxx ) allows you to group but without 'storing' this 
  • TobiTee
    TobiTee New Altair Community Member
    How do i share some of your process in demo format?
    just exporting the process or copy & past the XML Code?
  • TobiTee
    TobiTee New Altair Community Member
    edited November 2020
    404
  • Telcontar120
    Telcontar120 New Altair Community Member
    This isn't necessarily great news, but I had a similar problem with returning multiple lines with a regex search string a while back.  In fact @kayman helped me out a bit with the syntax for it then too!  But the sad thing is that I was never able to get it to work properly in RapidMiner, although it worked just fine in several other regex environments, so I suspect that there is some deficiency or variation in their implementation of regex related to line break characters that causes multiline mode or multiple sentence matches not to work properly.  I ended up solving the problem by doing the required matching bit in python instead.  Annoying when there should a native RapidMiner solution, but at least that can still be done inside a larger RapidMiner process if desired.
  • kayman
    kayman New Altair Community Member
    Yeah, regex has many dialects so it can be a bit annoying sometimes to get the right syntax at the right location...
    @TobiTee, thanks for the process already. If you do not mind could you also send me the excel (you can use my pm for that or just add it as an attachment here). Then I can reconstruct the whole flow

  • TobiTee
    TobiTee New Altair Community Member
    @Telcontar120 oh ok thanks! That explains a lot... 
    i tested the syntax in 2 other environments and even in the editor in Rapidminer everything seemed to work.
    But after running the process I only received the first match.

    I hope @kayman got  has some magical tips. otherwise there is still python..