How to extract a piece of text occurring before a certain type of formatting in html?

MRNJEM001
MRNJEM001 New Altair Community Member
edited November 5 in Community Q&A

Hi community :)

 

Beginner here, I've tried my best to figure it out but unfortunately haven't cracked the case.

 

I have a piece of software that outputs text in html with certain words in red. I need to get a document full of the immediately preceding word to all red words.

 

For example, I need to get this word and also that word as output.

 

Here is an example extract of the software output:

 

<span class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>substantial</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSansBold",serif;color:#E74E31'>challenges</span></span><span
class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
font-family:"FreeSans",serif;color:#4E4E4A'>in</span></span>

 

As you can see, it is quite messy (a random &nbsp; in between each word).

In this extract, challenges is the flagged word and I need to output substantial. In a document there are a few hundred red words.

 

Is there any way I can accomplish this in RapidMiner? I've tried using Cut Document, Documents to Data. Also the Rosette Text Analytics and Information-Extraction extensions, but I'm quite lost.

Thanks!

Best Answer

  • JEdward
    JEdward New Altair Community Member
    Answer ✓

    I took a very similar approach to Kayman for this.  I first replaced the &nbps with a space. 

    Then I replaced all occurances of the redword class span with REDWORDWOO

    Next I removed all other HTML tags from the document.  After tokenizing and generating 2Grams I can then select only the tokens that end "_REDWORDWOO“ which are the words before a redword. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="85">
    <parameter key="text" value="&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;a&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;reasonably&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-red&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSansBold&quot;,serif;color:#E74E31'&gt;good&lt;/span&gt;&lt;/span&gt;"/>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
    <list key="replace_dictionary">
    <parameter key="&amp;nbsp;" value=" "/>
    <parameter key="\&lt;span\sclass=span-red\&gt;" value="REDWORDWOO"/>
    <parameter key="\&lt;[\S\s]*?\&gt;" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">1. Replace nbsp&lt;br/&gt;2. Change all occurances of Red colour into REDWORDWOO&lt;br/&gt;3. Remove all other HTML content.</description>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="85"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="85">
    <description align="center" color="transparent" colored="false" width="126">Generate 2-grams</description>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="581" y="85">
    <parameter key="string" value="_REDWORDWOO"/>
    <description align="center" color="transparent" colored="false" width="126">Remove any tokens that don't end in _REDWORDWOO</description>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="715" y="85">
    <list key="replace_dictionary">
    <parameter key="_REDWORDWOO" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Remove _REDWORDWOO</description>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
    <connect from_op="Replace Tokens (2)" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
     

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    Is the exact html formatting around the words of interest always the same?  If so, you should be able to extract them pretty consistently using the Cut Document & Extract Information operators with an appropriately crafted regular expression or regular region selection.  Of course, if there are other words that have the same formatting, then you might end up with some "false positive" matches as well, but at least you would have the terms of interest.

     

  • MRNJEM001
    MRNJEM001 New Altair Community Member

    The trouble is there are thousands of words formatted in black. All the exact same html. I need to extract only the words just before the red formatted words. Would the method you describe be possible for that?

     

    The example below reads "a reasonably good" and I need to extract ie. "reasonably", but not "a" (and other words before that).

     

    <span
    class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
    font-family:"FreeSans",serif;color:#4E4E4A'>a</span></span><span
    class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
    line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
    class=span-black><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
    font-family:"FreeSans",serif;color:#4E4E4A'>reasonably</span></span><span
    class=apple-converted-space><span lang=EN-ZA style='font-size:10.0pt;
    line-height:107%;font-family:"FreeSans",serif;color:#4E4E4A'>&nbsp;</span></span><span
    class=span-red><span lang=EN-ZA style='font-size:10.0pt;line-height:107%;
    font-family:"FreeSansBold",serif;color:#E74E31'>good</span></span>
  • Telcontar120
    Telcontar120 New Altair Community Member

    Ah, I see, so my "simple" solution of taking everything between the "black" and "red" codes would create way too many false positives to be helpful.  

    It still seems like a properly crafted regex should be able to get this using a backwards lookup, but it's beyond my limited skills with all the html formatting in the way.  Maybe one of our other regex experts could weigh in?   @Edin_Klapic @BalazsBarany @kayman @JEdward @Thomas_Ott

  • kayman
    kayman New Altair Community Member

    Hmm, nice challenge...

     

    Not sure if it is the best approach but a possible approach would be to make use of the fact that red words are by default encapsulated by a span containing the style. So a multistep approach might work.

     

    Some assumptions I make based on your code snippet :

    Blackwords are in a so called span-black tag, red words in the span-red tags, other spans can be removed because no real use.

     

    First capture the red tags as follows

    (?ms)<span\sclass=span-red><span.*?E74E31'>(.*?)<\/span><\/span>

    and replace with redword:$1
    next capture the black tags as follows :

    (?ms)<span\sclass=span-black><span.*?4E4E4A'>(.*?)<\/span><\/span>

     and replace for instance with blackword:$1

    and finally remove all of the other span tags as follows

    (?ms)<\/?span.*?>

    And replace with nothing .

    This would give you something like 

     

    blackword:a&nbsp;blackword:reasonably&nbsp;redword:good

    So, now we still have this dreadfull non breaking spaces, which we just replace with a space so we get 

     

    blackword:a blackword:reasonably redword:good

    And now we get the blackwords followed by a redword as follows :

     

    (?ms)blackword:(\w+)\s*redword:\w+

    and replace with $1, leaving you with 

    blackword:a reasonably

    So now the only thing left is to get rid of every word starting with blackword (since these were not followed by a redword) and you have your 'first black before a red one' word

     

    (?ms)blackword:\w+\s*(\w+)

    replace once again with $1 and what you get left with is reasonable.

     

    Tried with multiple copies of your snippet and works pretty fine, but not guaranteed to work on actual data so you may need to tweak a bit

  • JEdward
    JEdward New Altair Community Member
    Answer ✓

    I took a very similar approach to Kayman for this.  I first replaced the &nbps with a space. 

    Then I replaced all occurances of the redword class span with REDWORDWOO

    Next I removed all other HTML tags from the document.  After tokenizing and generating 2Grams I can then select only the tokens that end "_REDWORDWOO“ which are the words before a redword. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="85">
    <parameter key="text" value="&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;a&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-black&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;reasonably&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=apple-converted-space&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;&#10;line-height:107%;font-family:&quot;FreeSans&quot;,serif;color:#4E4E4A'&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span&#10;class=span-red&gt;&lt;span lang=EN-ZA style='font-size:10.0pt;line-height:107%;&#10;font-family:&quot;FreeSansBold&quot;,serif;color:#E74E31'&gt;good&lt;/span&gt;&lt;/span&gt;"/>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
    <list key="replace_dictionary">
    <parameter key="&amp;nbsp;" value=" "/>
    <parameter key="\&lt;span\sclass=span-red\&gt;" value="REDWORDWOO"/>
    <parameter key="\&lt;[\S\s]*?\&gt;" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">1. Replace nbsp&lt;br/&gt;2. Change all occurances of Red colour into REDWORDWOO&lt;br/&gt;3. Remove all other HTML content.</description>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="85"/>
    <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.1.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="447" y="85">
    <description align="center" color="transparent" colored="false" width="126">Generate 2-grams</description>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_content" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="581" y="85">
    <parameter key="string" value="_REDWORDWOO"/>
    <description align="center" color="transparent" colored="false" width="126">Remove any tokens that don't end in _REDWORDWOO</description>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="8.1.000" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="715" y="85">
    <list key="replace_dictionary">
    <parameter key="_REDWORDWOO" value=" "/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Remove _REDWORDWOO</description>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
    <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
    <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
    <connect from_op="Replace Tokens (2)" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
     
  • Telcontar120
    Telcontar120 New Altair Community Member

    I am always impressed by the creativity of the RapidMiner community!!  Thanks @JEdward and @kayman for these innovative solutions to a tricky problem.  

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Thanks for the tag but I'm too late to this challenge. @Telcontar120 said!

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Thanks for the tag but I'm too late to this challenge. What @Telcontar120 said!

  • MRNJEM001
    MRNJEM001 New Altair Community Member

    Thank you for the amazing solutions @JEdward and @kayman, you are incredible. It's for a thesis, so if it ever gets published you should be in the acknowledgements :)

     

    Thanks again!