How to extract a piece of text occurring before a certain type of formatting in html?

New Altair Community Member

May 15, 2018

Updated Nov 5, 2024 by Jocelyn

Hi community

Beginner here, I've tried my best to figure it out but unfortunately haven't cracked the case.

I have a piece of software that outputs text in html with certain words in red. I need to get a document full of the immediately preceding word to all red words.

For example, I need to get this word and also that word as output.

Here is an example extract of the software output:

<span class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>substantial</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSansBold",serif;color:#E74E31'>challenges</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>in</span></span>

As you can see, it is quite messy (a random   in between each word).

In this extract, challenges is the flagged word and I need to output substantial. In a document there are a few hundred red words.

Is there any way I can accomplish this in RapidMiner? I've tried using Cut Document, Documents to Data. Also the Rosette Text Analytics and Information-Extraction extensions, but I'm quite lost.

Thanks!

Find more posts tagged with

AI Studio

Text Mining + NLP

Getting Started

🎉Community Raffle - Win $25

How to extract a piece of text occurring before a certain type of formatting in html?

Find more posts tagged with

Quick Links