🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to extract a piece of text occurring before a certain type of formatting in html?

User: "MRNJEM001"
New Altair Community Member
Updated by Jocelyn

Hi community :)

 

Beginner here, I've tried my best to figure it out but unfortunately haven't cracked the case.

 

I have a piece of software that outputs text in html with certain words in red. I need to get a document full of the immediately preceding word to all red words.

 

For example, I need to get this word and also that word as output.

 

Here is an example extract of the software output:

 

<span class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>substantial</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSansBold",serif;color:#E74E31'>challenges</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>in</span></span>

 

As you can see, it is quite messy (a random &nbsp; in between each word).

In this extract, challenges is the flagged word and I need to output substantial. In a document there are a few hundred red words.

 

Is there any way I can accomplish this in RapidMiner? I've tried using Cut Document, Documents to Data. Also the Rosette Text Analytics and Information-Extraction extensions, but I'm quite lost.

Thanks!