Get Page operator stalls Rapidminer (SOLVED)

Question

When running the process below (with web mining and text mining extensions loaded) RapidMiner stalls when trying to display the results. It eventually shows the results but something seems to be running in the background and it makes RapidMiner very sluggish. I've been using this for years. Also tried version 10 and I'm experiencing the same issue. Note: I wasn't allowed to post links which were in the XML code. To replicate just add 2 random links to the Get Page operator. Any ideas?

MarkusH23 · Answer

Thanks @ceaperez

It makes a small difference but it still takes minutes to display the results from two web pages.

The issue was with the Document Vector creation as not producing a document vector resolved the issue. If you need a document vector of the HTML content, then a tokenizer will also eliminate the long wait time and unresponsiveness. In Rapidminer, when not using a tokenizer, the entire document is a token and RM seems to struggle to render this.

Thanks again

Markus

Caperez · Answer

Hi @MarkusH23, 
I tested your process with another regular expression because the URL is not included in your data.
Just changing the compatibility level into the Extract Information operator, the model run faster and more stable.

please try it.

Best,

Cesar