Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
Web Mining - Web Page Similarity
Prat_1
Hello,
I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.
I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators. I then tokenize the webpages, use stopwords and transform cases. Finally, I use the "data to similarity" operator.
However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.
Could somebody advise on how to do this? I will be really thankful!
Have a good day!
- Prat
Find more posts tagged with
AI Studio
Accepted answers
All comments
MariusHelf
Hi, Extract Content expects a Document object on its input, which is delivered by Get Page. Get Pages, however, delivers an example set, which you pass into the Process Documents from Data operator. Inside that operator, you again have Document object and should be able to apply Extract Content.
Happy Mining!
~Marius
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups