How to extract a piece of text occurring before a certain type of formatting in html?
Hi community
Beginner here, I've tried my best to figure it out but unfortunately haven't cracked the case.
I have a piece of software that outputs text in html with certain words in red. I need to get a document full of the immediately preceding word to all red words.
For example, I need to get this word and also that word as output.
Here is an example extract of the software output:
<span class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>substantial</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'> </span></span><span
class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSansBold",serif;color:#E74E31'>challenges</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'> </span></span><span
class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>in</span></span>
As you can see, it is quite messy (a random in between each word).
In this extract, challenges is the flagged word and I need to output substantial. In a document there are a few hundred red words.
Is there any way I can accomplish this in RapidMiner? I've tried using Cut Document, Documents to Data. Also the Rosette Text Analytics and Information-Extraction extensions, but I'm quite lost.
Thanks!