"[Solved]Crawling rules"
ArnoG
New Altair Community Member
I'm trying to crawl a bookingsite for hotels. I want to crawl the reviews. For example the url: http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS
I use Crawl web as a operater but I don't get output.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
<parameter key="url" value="http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland.+"/>
<parameter key="follow_link_with_matching_url" value=".+Reviews-or10.+"/>
</list>
<parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
<parameter key="extension" value="html"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Can anybody tell me what I,m doing wrong?
Thanxs, Arno
I use Crawl web as a operater but I don't get output.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
<parameter key="url" value="http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland.+"/>
<parameter key="follow_link_with_matching_url" value=".+Reviews-or10.+"/>
</list>
<parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
<parameter key="extension" value="html"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Can anybody tell me what I,m doing wrong?
Thanxs, Arno
Tagged:
0
Answers
-
Arno,
your rule for storing misses the -or10. Use copy paste next time<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
<parameter key="url" value="http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+Reviews\-or10\-EasyHotel_Amsterdam-Amsterdam_North_Holland.+"/>
<parameter key="follow_link_with_matching_url" value=".+Reviews\-or10.+"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius, thanks a lot. You are right. Mext time copy paste
Arno0 -
Hi Marius,
You helped me a great deal with crawling this url: http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS
Now I created a xpath to retrieve the reviews. The xpath works in Google dobs but not in Rapidminor. The reason is that I have to crawl following url:
http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#
They lead to the same reviews. I like to use Rapidminer to follow the pages. The only thing that changes going to a next page is for example -r155685828. The URL of the next page is the same , exept the r#. This hans changed in r162587896.
My proces is:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
<parameter key="url" value="http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS.+"/>
<parameter key="follow_link_with_matching_url" value=".+r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS.+"/>
</list>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_page_size" value="10000"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Can you ones help me again?
Thanxs, Arno
0