Problem with separate html structure from content of a web pages

Question

Hi everyone,

I have a problem. I'm trying to separate html structure from content of a webpages. I guess this is possible with Extract Content operator-a in Rapid Miner 5. I've tried so many different ways to use this operator but there is always an error:-"(com.rapidminer.operator.IOObjectCollection cannot be cast to com.rapidminer.operator.text.Document)".

I've tried : Read Excel-> Get Pages-> -> Extract Content-> Data to Document-> Process Document -> Data to Similarity -> Tokenize...

I've tried Extract Content operator and with Web Crawler. I've tried to change places of the operators...but there is always the same error.

Am I wrong in my actions? And that should I do? In general, is this the way to separate the structure from content or there is something different?
 I'm a student  and I'm a very begginer and I have to do some kind of text mining for my Thesis. If anybody has some ideas I'll be very grateful.

Thanks in advance!

Kind regards,
Iliqna

colo · Answer

Hi Iliqna,

since I consider this the better place for your question, I will answer here instead of replying to the similar post in the "Chit Chat" Forum (http://rapid-i.com/rapidforum/index.php/topic,3660.0.html).

In general your idea to extract only the content (and get rid of the HTML tags) with the "Extract Contect" operator isn't wrong. If you replace the "Get Pages" by "Get Page" it will work. This is because "Extract Content" works on one single document, but "Get Pages" and "Crawl Web" generate example sets as output.iletoooo  wrote:I've tried : Read Excel-> Get Pages-> -> Extract Content-> Data to Document-> Process Document -> Data to Similarity -> Tokenize..

If you want to use one of these two operators I would suggest the following chain:

Read Excel -> Get Pages -> Process Documents From Data

You can then place "Extract Content" inside the "Process Documemts from Data" operator (the inner chain will process each example as a single document). This is also a proper place for tokenization or similar text processing.

Regards
Matthias