RegEx query returns only one word instead of a complete sentence
TobiTee
New Altair Community Member
Hey I am new to Rapidminer and try to analyze text for my Bachelor thesis.I have already pre-processed (e.g. tokenized etc. ) the Documents and would like to use "extract information" and regular expressions to get all sentences containing the word "Kenntnisse".
I have already tested some expressions on regex101.com and regexr.com, all worked.
Examples: ^.*(kenntnisse|Kenntnisse|kennt*) or (?m)^.*?(Kenntnisse).*$ But as soon as I use the query in "extract information", I only get the word "Kenntnisse", not the whole sentence / paragraph.
can anyone help me?
Thanks guys!
Thanks guys!
0
Best Answers
-
You use the group () regex around kenntnisse, so it's normal that's the only thing returned as you don't select pre or suffix. If you want the whole sentence you need to use ( at the beginning and the closing one on the end.5
-
Well, magical is maybe a bit overrated but I did figure out the issue.
What you are trying to do is extract multiple sentences in one go, and that isn't exactly supported. While the operator correctly provides what is selected with the regex it has no real clue what to do with the part that doesn't match, so it just keeps it as is, which is actually correct but may look strange. The operator just sees 'Ah, I have this in my content so I allow the full thing' the way it is constructed now.
You could use this to get the first match, or the last match, or in between matches, but you cannot use it to say 'I want sentence one and 5' as the operator cannot do that. The regex emulator is a common one, so the replace thing tricks us here as there is no replace. Just match...
A workaround would be to tokenise by sentence first, and then do the extract, but that's pretty heavy, so a better way around is to use negative lookaheads. So rather than keeping what you need you basically remove what you don't need.
You can use negative lookahead for that, so something like
(?mi)^(?!.*kenntnisse).*$
and replace it with nothing. This works with the data operators, the document ones do not support replace with nothing so it's a bit more complex then.
I've simplified your process a bit using this logic, so actually using a replace instead of extract, and at first glance it seems to work also. I'll attach this if the browser allows me, hope this gets you further. You can just import the attached rmp process.
(BTW, maybe it's best to remove your XML again, seems browsers have a hard time dealing with it once they get a certain size...)
5
Answers
-
You use the group () regex around kenntnisse, so it's normal that's the only thing returned as you don't select pre or suffix. If you want the whole sentence you need to use ( at the beginning and the closing one on the end.5
-
oh man... thank you for that!That's almost a bit embarrassing...
Before I ask more questions, I try a few more things and read further into the topic.
But, i will be back1 -
Hi again,
so it looks like the query is working (?i)[^.\s]*Kenntnisse*[^\n]*
Only one (the first) match is displayed in the results, where in the editor 4 matches are displayed.
Is there again something I have forgotten? I thought this was achieved by the "multiline mode", but it seems to make no difference.
0 -
Could you share some of your process in demo format? it's a bit hard to see where the issue is without full visibility.
the [^.\s] part basically means 'anything but actual dots or spaces' so it may not give you the results you need and this is probably the reason you only get the first match and multiline isn't working
Try maybe with something like this :
(?i)^.*\bkenntnisse\b.*$
The \b means word boundary, so kind of everything but a character, therefore the above kind of states 'if the word kenntnisse is between start (^) and end ($), whatever casing used, to have a match...
Multiline mode would allow you to use this line by line, so you probably wouldn't even need the ^ and $ characters as that's considered the default then, but it never harms...
If you want to focus on multiple words you can use following
(?i)^.*\b(?:kenntnisse|other_word|something_else)\b.*$
the (?: xxx ) allows you to group but without 'storing' this0 -
How do i share some of your process in demo format?just exporting the process or copy & past the XML Code?0
-
404
0 -
This isn't necessarily great news, but I had a similar problem with returning multiple lines with a regex search string a while back. In fact @kayman helped me out a bit with the syntax for it then too! But the sad thing is that I was never able to get it to work properly in RapidMiner, although it worked just fine in several other regex environments, so I suspect that there is some deficiency or variation in their implementation of regex related to line break characters that causes multiline mode or multiple sentence matches not to work properly. I ended up solving the problem by doing the required matching bit in python instead. Annoying when there should a native RapidMiner solution, but at least that can still be done inside a larger RapidMiner process if desired.1
-
Yeah, regex has many dialects so it can be a bit annoying sometimes to get the right syntax at the right location...
@TobiTee, thanks for the process already. If you do not mind could you also send me the excel (you can use my pm for that or just add it as an attachment here). Then I can reconstruct the whole flow
-1 -
@Telcontar120 oh ok thanks! That explains a lot...i tested the syntax in 2 other environments and even in the editor in Rapidminer everything seemed to work.
But after running the process I only received the first match.
I hope @kayman got has some magical tips. otherwise there is still python..0 -
Well, magical is maybe a bit overrated but I did figure out the issue.
What you are trying to do is extract multiple sentences in one go, and that isn't exactly supported. While the operator correctly provides what is selected with the regex it has no real clue what to do with the part that doesn't match, so it just keeps it as is, which is actually correct but may look strange. The operator just sees 'Ah, I have this in my content so I allow the full thing' the way it is constructed now.
You could use this to get the first match, or the last match, or in between matches, but you cannot use it to say 'I want sentence one and 5' as the operator cannot do that. The regex emulator is a common one, so the replace thing tricks us here as there is no replace. Just match...
A workaround would be to tokenise by sentence first, and then do the extract, but that's pretty heavy, so a better way around is to use negative lookaheads. So rather than keeping what you need you basically remove what you don't need.
You can use negative lookahead for that, so something like
(?mi)^(?!.*kenntnisse).*$
and replace it with nothing. This works with the data operators, the document ones do not support replace with nothing so it's a bit more complex then.
I've simplified your process a bit using this logic, so actually using a replace instead of extract, and at first glance it seems to work also. I'll attach this if the browser allows me, hope this gets you further. You can just import the attached rmp process.
(BTW, maybe it's best to remove your XML again, seems browsers have a hard time dealing with it once they get a certain size...)
5