Hi everyone, this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have. Example: This is an example Text ..... This is an example Text ... Keyword ... UniqueEndTag So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose? Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:

"Intelligent Text Extraction"

Hi everyone,

this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.

Example:

<td style=" width:52.50%; text-align:left; " class="ta_10">This is an example Text </td>

.....

<td style=" width:100.00%; text-align:left; " class="ta_30">This is an example Text</td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10">Keyword </td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>

So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?

Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:

Find more posts tagged with

AI Studio

Text Mining + NLP

Accepted answers

All comments

MartinLiebig

Hi,

this seems tricky. My approach would either be a (tricky) regex or something like HTML to XML and then Process XSLT?

~Martin

limegreenman900

Hi Martin,

I know, normally a RegEx would be the best solution if I would have some structure where I could distinguish between my different start texts, however I don't know whether a very complex regex that contains multiline forward and backlooking features will run into performance issues as I have a lot of documents....

For XSLT I doubt that it would work as my text has no unique tags but it randomly formatted with inline classes which do not have to contain similar attributes...

To get back to my originally question: Are you aware of any operator within the IE plugin that could adress this problem? Or is this really something that I will have to do with "Cut Documents" and the Regular Region Parameter?

limegreenman900

Or is there any possibility that I could extract one text as a reference and "train" RapidMiner to detect this part in all other files due to high similarity?