🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Extracl Data from HTML pages with loops"

User: "Kausty88"
New Altair Community Member
Updated by Jocelyn
Hi Team,

I have been a big of Rapidminer and I try to explore more and more into this tool. Today, I wanted to scrape the data from the review site.
1. Download the pages from the site
2. Crawl through each page to extract the data

I am able to do the first part and able to download the page from the URL: http://www.reevoo.com/p/acer-aspire-v5-431-987b4g50mass/page/1

Then I want to capture data from each review and I was able to capture the Xpath from the googledocs exactly the way explained in http://www.youtube.com/watch?v=vKW5yd1eUpA

I want my process to not only loop through multiple files but also through the file itself for multiple reviews. 1 file has approximately 8 reviews and I want to loop through this file as well as 7 other files so in all 64 reviews. I am using "Process document from file" --> "Extract Information"

Settings for - "Process document from file"

File from a list of directories, file pattern - *, use file extension, add metadata information

Settings for "Extract Information"

Query type - Xpath, Attribute type - nominal, Xpath queries as below, namespace - nothing, Ignore CDATA and Assume HTML - checked

But when I am using that in the tool, I am not able to configure that due to some reason and its failing. Can anyone please advice me here? ???

Here is my xpath in the extract information operator:

1. //h:*[@class="review comment "]/h:div/h:h4/span (User)
2. //h:*[@class="review comment "]/h:div/h:h4/a (User_Type)
3. //h:*[@class="review comment "]/h:div/h:span/h:span/span[1] (Ratings)
4. //h:*[@class="review comment "]/h:div/h:dl/dd[1] (Pros)
5. //h:*[@class="review comment "]/h:div/h:dl/dd[2] (Cons)
6. //h:*[@class="review comment "]/h:div/h:p/span (Purchase_Date)
7. //h:*[@class="review comment "]/h:div/h:div/span (Review_Helpful)

Find more posts tagged with