🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

RegEx query returns only one word instead of a complete sentence

User: "TobiTee"
New Altair Community Member
Updated by Jocelyn
Hey I am new to Rapidminer and try to analyze text for my Bachelor thesis.I have already pre-processed (e.g. tokenized etc. ) the Documents  and would like to use "extract information" and regular expressions to get all sentences containing the word "Kenntnisse". 
I have already tested some expressions on regex101.com and regexr.com, all worked. 
Examples: ^.*(kenntnisse|Kenntnisse|kennt*) or (?m)^.*?(Kenntnisse).*$  But as soon as I use the query in "extract information", I only get the word "Kenntnisse", not the whole sentence / paragraph.

can anyone help me?

Thanks guys!


Sort by:
1 - 2 of 21
    User: "kayman"
    New Altair Community Member
    Accepted Answer
    You use the group () regex around kenntnisse, so it's normal that's the only thing returned as you don't select pre or suffix. If you want the whole sentence you need to use ( at the beginning and the closing one on the end. 
    User: "kayman"
    New Altair Community Member
    Accepted Answer
    Well, magical is maybe a bit overrated but I did figure out the issue.
    What you are trying to do is extract multiple sentences in one go, and that isn't exactly supported. While the operator correctly provides what is selected with the regex it has no real clue what to do with the part that doesn't match, so it just keeps it as is, which is actually correct but may look strange. The operator just sees 'Ah, I have this in my content so I allow the full thing' the way it is constructed now.

    You could use this to get the first match, or the last match, or in between matches, but you cannot use it to say 'I want sentence one and 5' as the operator cannot do that. The regex emulator is a common one, so the replace thing tricks us here as there is no replace. Just match...

    A workaround would be to tokenise by sentence first, and then do the extract, but that's pretty heavy, so a better way around is to use negative lookaheads. So rather than keeping what you need you basically remove what you don't need.

    You can use negative lookahead for that, so something like

    (?mi)^(?!.*kenntnisse).*$

    and replace it with nothing. This works with the data operators, the document ones do not support replace with nothing so it's a bit more complex then.

    I've simplified your process a bit using this logic, so actually using a replace instead of extract, and at first glance it seems to work also. I'll attach this if the browser allows me, hope this gets you further. You can just import the attached rmp process.
    (BTW, maybe it's best to remove your XML again, seems browsers have a hard time dealing with it once they get a certain size...)