"Intelligent Text Extraction"
Hi everyone,
this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.
Example:
<td style=" width:52.50%; text-align:left; " class="ta_10"><span class="ta_11">This is an example Text </span></td>
.....
<td style=" width:100.00%; text-align:left; " class="ta_30"><span class="ta_31">This is an example Text</span></td>
...
<td style=" width:100.00%; text-align:left; " class="ta_10"><span class="ta_11">Keyword </span></td>
...
<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>
So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?
Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:
Answers
-
Hi,
this seems tricky. My approach would either be a (tricky) regex or something like HTML to XML and then Process XSLT?
~Martin
0 -
Hi Martin,
I know, normally a RegEx would be the best solution if I would have some structure where I could distinguish between my different start texts, however I don't know whether a very complex regex that contains multiline forward and backlooking features will run into performance issues as I have a lot of documents....
For XSLT I doubt that it would work as my text has no unique tags but it randomly formatted with inline <span> classes which do not have to contain similar attributes...
To get back to my originally question: Are you aware of any operator within the IE plugin that could adress this problem? Or is this really something that I will have to do with "Cut Documents" and the Regular Region Parameter?
0 -
Or is there any possibility that I could extract one text as a reference and "train" RapidMiner to detect this part in all other files due to high similarity?
0