Web Mining - Web Page Similarity

Question

Hello,

I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.

I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators.  I then tokenize the webpages, use stopwords and transform cases. Finally, I use the  "data to similarity" operator.

However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.

Could somebody advise on how to do this? I will be really thankful!

Have a good day!

- Prat