🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Web Mining - Web Page Similarity

Prat_1User: "Prat_1"
New Altair Community Member
Updated by Jocelyn
Hello,

I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.

I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators.  I then tokenize the webpages, use stopwords and transform cases. Finally, I use the  "data to similarity" operator.

However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.

Could somebody advise on how to do this? I will be really thankful!

Have a good day!

- Prat 

Find more posts tagged with

Sort by:
1 - 1 of 11
    Hi, Extract Content expects a Document object on its input, which is delivered by Get Page. Get Pages, however, delivers an example set, which you pass into the Process Documents from Data operator. Inside that operator, you again have Document object and should be able to apply Extract Content.

    Happy Mining!
    ~Marius