"Mining online reviews for sentiment analysis"
janjan
New Altair Community Member
I am trying to capture reviews about a specific product from amazon in order to do sentiment analysis by applying a classification model to predict positive or negative attitudes. Two questions:
1) Regarding getting the data: How do you limit the crawl to just the reviews. The reviews for the product are several pages long, each page link looks like this:
http://www.amazon.com/Rainbow-Loom-Twistz-Bandz/product-reviews/B00DMC6KAC/ref=cm_cr_pr_btm_link_2?ie=UTF8&;pageNumber=2&showViewpoints=0&sortBy=byRankDescending
...with the pageNumber number in the link changing based on the page number of course. I want to crawl just these pages, but each review page has tons of other links eg to amazon.com, to online ads etc. Is there a character (like *) that I can use instead of the page number to specify that I only want to crawl only these links?
2) How can I get individual reviews (several on a page) into its own text document (or maybe its own field in a database record) so it can be classified?
1) Regarding getting the data: How do you limit the crawl to just the reviews. The reviews for the product are several pages long, each page link looks like this:
http://www.amazon.com/Rainbow-Loom-Twistz-Bandz/product-reviews/B00DMC6KAC/ref=cm_cr_pr_btm_link_2?ie=UTF8&;pageNumber=2&showViewpoints=0&sortBy=byRankDescending
...with the pageNumber number in the link changing based on the page number of course. I want to crawl just these pages, but each review page has tons of other links eg to amazon.com, to online ads etc. Is there a character (like *) that I can use instead of the page number to specify that I only want to crawl only these links?
2) How can I get individual reviews (several on a page) into its own text document (or maybe its own field in a database record) so it can be classified?
Tagged:
0
Answers
-
Hi,
I suppose you are using the Crawl Web operator to crawl the pages. That operator supports regular expressions in the crawling rules. You'll find tons of documentation for regular expressions on the web. The wildcard for an arbitrary amount of digits is \d+ (\d = one digit, + means one or more of them).
To split the reviews one option would be to use Process Documents on the crawled pages, and use Split Documents to split the complete site into single reviews.
Best regards,
Marius0 -
Hi Marius
I want to get previous year news from web using Crawl web operator. I am applying web crawling but it is providing me results of few months back, Even I increase the depth but still. Can you guide me how can I refine my Search to get best Historic data from websites?
Thanks
Sourabh Choudhary0 -
Hi Sourabh,
that depends completely on the websites - you have to define the correct crawling rules, maybe combined with filters on the retrieved documents afterwards.
Unfortunately there is no general rule, you really have to look into the structure of the websites.
Best regards,
Marius0 -
Hi Marius,
Thanks for your Suggestions. I am trying over the combinations of filters with Crawling rules. ASAP I will be able to do exactly what I want, I will share at forum.
Regards
Sourabh0 -
Hi Marius
I want to search for the related valuable information about specific key word or key name on the web(social Media & Forums, Blogs, Search Engines, News websites,News Blogs etc.)using Rapidminer. Please help me How can I do it..
Thanks
Sourabh0