"Intelligent Text Extraction"

New Altair Community Member

May 30, 2016

Updated Nov 5, 2024 by Jocelyn

Hi everyone,

this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.

Example:

<td style=" width:52.50%; text-align:left; " class="ta_10">This is an example Text </td>

.....

<td style=" width:100.00%; text-align:left; " class="ta_30">This is an example Text</td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10">Keyword </td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>

So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?

Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:

Find more posts tagged with

AI Studio

Text Mining + NLP

🎉Community Raffle - Win $25

"Intelligent Text Extraction"

Find more posts tagged with

Quick Links