Web Mining - Web Page Similarity

Question

Hello,

I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.

I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators.  I then tokenize the webpages, use stopwords and transform cases. Finally, I use the  "data to similarity" operator.

However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.

Could somebody advise on how to do this? I will be really thankful!

Have a good day!

- Prat

MariusHelf · Answer

Hi, Extract Content expects a Document object on its input, which is delivered by Get Page. Get Pages, however, delivers an example set, which you pass into the Process Documents from Data operator. Inside that operator, you again have Document object and should be able to apply Extract Content.

Happy Mining!
~Marius