"Extracl Data from HTML pages with loops"
Kausty88
New Altair Community Member
Hi Team,
I have been a big of Rapidminer and I try to explore more and more into this tool. Today, I wanted to scrape the data from the review site.
1. Download the pages from the site
2. Crawl through each page to extract the data
I am able to do the first part and able to download the page from the URL: http://www.reevoo.com/p/acer-aspire-v5-431-987b4g50mass/page/1
Then I want to capture data from each review and I was able to capture the Xpath from the googledocs exactly the way explained in http://www.youtube.com/watch?v=vKW5yd1eUpA
I want my process to not only loop through multiple files but also through the file itself for multiple reviews. 1 file has approximately 8 reviews and I want to loop through this file as well as 7 other files so in all 64 reviews. I am using "Process document from file" --> "Extract Information"
Settings for - "Process document from file"
File from a list of directories, file pattern - *, use file extension, add metadata information
Settings for "Extract Information"
Query type - Xpath, Attribute type - nominal, Xpath queries as below, namespace - nothing, Ignore CDATA and Assume HTML - checked
But when I am using that in the tool, I am not able to configure that due to some reason and its failing. Can anyone please advice me here? ???
Here is my xpath in the extract information operator:
1. //h:*[@class="review comment "]/h:div/h:h4/span (User)
2. //h:*[@class="review comment "]/h:div/h:h4/a (User_Type)
3. //h:*[@class="review comment "]/h:div/h:span/h:span/span[1] (Ratings)
4. //h:*[@class="review comment "]/h:div/h:dl/dd[1] (Pros)
5. //h:*[@class="review comment "]/h:div/h:dl/dd[2] (Cons)
6. //h:*[@class="review comment "]/h:div/h:p/span (Purchase_Date)
7. //h:*[@class="review comment "]/h:div/h:div/span (Review_Helpful)
I have been a big of Rapidminer and I try to explore more and more into this tool. Today, I wanted to scrape the data from the review site.
1. Download the pages from the site
2. Crawl through each page to extract the data
I am able to do the first part and able to download the page from the URL: http://www.reevoo.com/p/acer-aspire-v5-431-987b4g50mass/page/1
Then I want to capture data from each review and I was able to capture the Xpath from the googledocs exactly the way explained in http://www.youtube.com/watch?v=vKW5yd1eUpA
I want my process to not only loop through multiple files but also through the file itself for multiple reviews. 1 file has approximately 8 reviews and I want to loop through this file as well as 7 other files so in all 64 reviews. I am using "Process document from file" --> "Extract Information"
Settings for - "Process document from file"
File from a list of directories, file pattern - *, use file extension, add metadata information
Settings for "Extract Information"
Query type - Xpath, Attribute type - nominal, Xpath queries as below, namespace - nothing, Ignore CDATA and Assume HTML - checked
But when I am using that in the tool, I am not able to configure that due to some reason and its failing. Can anyone please advice me here? ???
Here is my xpath in the extract information operator:
1. //h:*[@class="review comment "]/h:div/h:h4/span (User)
2. //h:*[@class="review comment "]/h:div/h:h4/a (User_Type)
3. //h:*[@class="review comment "]/h:div/h:span/h:span/span[1] (Ratings)
4. //h:*[@class="review comment "]/h:div/h:dl/dd[1] (Pros)
5. //h:*[@class="review comment "]/h:div/h:dl/dd[2] (Cons)
6. //h:*[@class="review comment "]/h:div/h:p/span (Purchase_Date)
7. //h:*[@class="review comment "]/h:div/h:div/span (Review_Helpful)
Tagged:
0