🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to extract a piece of text occurring before a certain type of formatting in html?

User: "MRNJEM001"
New Altair Community Member
Updated by Jocelyn

Hi community :)

 

Beginner here, I've tried my best to figure it out but unfortunately haven't cracked the case.

 

I have a piece of software that outputs text in html with certain words in red. I need to get a document full of the immediately preceding word to all red words.

 

For example, I need to get this word and also that word as output.

 

Here is an example extract of the software output:

 

<span class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>substantial</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSansBold",serif;color:#E74E31'>challenges</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>in</span></span>

 

As you can see, it is quite messy (a random &nbsp; in between each word).

In this extract, challenges is the flagged word and I need to output substantial. In a document there are a few hundred red words.

 

Is there any way I can accomplish this in RapidMiner? I've tried using Cut Document, Documents to Data. Also the Rosette Text Analytics and Information-Extraction extensions, but I'm quite lost.

Thanks!

Sort by:
1 - 1 of 11
    User: "JEdward"
    New Altair Community Member
    Accepted Answer

    I took a very similar approach to Kayman for this.  I first replaced the &nbps with a space. 

    Then I replaced all occurances of the redword class span with REDWORDWOO

    Next I removed all other HTML tags from the document.  After tokenizing and generating 2Grams I can then select only the tokens that end "_REDWORDWOO“ which are the words before a redword. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="85">
    <parameter key="text" value="&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;a&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;reasonably&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-red&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSansBold&quot;,serif;color:#E74E31'&gt;good&lt;/span&gt;&lt;/span&gt;"/>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
    <list key="replace_dictionary">
    <parameter key="&amp;nbsp;" value=" "/>
    <parameter key="\&lt;span\sclass=span-red\&gt;" value="REDWORDWOO"/>
    <parameter key="\&lt;[\S\s]*?\&gt;" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">1. Replace nbsp&lt;br/&gt;2. Change all occurances of Red colour into REDWORDWOO&lt;br/&gt;3. Remove all other HTML content.</description>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="85"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="85">
    <description align="center" color="transparent" colored="false" width="126">Generate 2-grams</description>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="581" y="85">
    <parameter key="string" value="_REDWORDWOO"/>
    <description align="center" color="transparent" colored="false" width="126">Remove any tokens that don't end in _REDWORDWOO</description>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="715" y="85">
    <list key="replace_dictionary">
    <parameter key="_REDWORDWOO" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Remove _REDWORDWOO</description>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
    <connect from_op="Replace Tokens (2)" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>