"[Solved]Crawling rules"

ArnoG
ArnoG New Altair Community Member
edited November 5 in Community Q&A
I'm trying to crawl a bookingsite for hotels. I want to crawl the reviews. For example the url: http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS

I use Crawl web as a operater but I don't get output.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="75">
       <parameter key="url" value="http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
       <list key="crawling_rules">
         <parameter key="store_with_matching_url" value=".+Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland.+"/>
         <parameter key="follow_link_with_matching_url" value=".+Reviews-or10.+"/>
       </list>
       <parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
       <parameter key="extension" value="html"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>


Can anybody tell me what I,m doing wrong?

Thanxs, Arno

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Arno,

    your rule for storing misses the -or10. Use copy paste next time :)
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
            <parameter key="url" value="http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+Reviews\-or10\-EasyHotel_Amsterdam-Amsterdam_North_Holland.+"/>
              <parameter key="follow_link_with_matching_url" value=".+Reviews\-or10.+"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="5"/>
            <parameter key="max_page_size" value="10000"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • ArnoG
    ArnoG New Altair Community Member
    Hi Marius, thanks a lot. You are right. Mext time copy paste :)

    Arno
  • ArnoG
    ArnoG New Altair Community Member
    Hi Marius,

    You helped me a great deal with crawling this url: http://www.tripadvisor.nl/Hotel_Review-g188590-d2333086-Reviews-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS

    Now I created a xpath to retrieve the reviews. The xpath works in Google dobs but not in Rapidminor. The reason is that I have to crawl following url:
    http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#

    They lead to the same reviews. I like to use Rapidminer to follow the pages. The only thing that changes going to a next page is for example -r155685828. The URL of the next page is the same , exept the r#. This hans changed in r162587896.

    My proces is:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
            <parameter key="url" value="http://www.tripadvisor.nl/ShowUserReviews-g188590-d2333086-r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS.+"/>
              <parameter key="follow_link_with_matching_url" value=".+r155685828-EasyHotel_Amsterdam-Amsterdam_North_Holland_Province.html#REVIEWS.+"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawl"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="5"/>
            <parameter key="max_page_size" value="10000"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Can you ones help me again?

    Thanxs, Arno