"WEB crawler rules"

Unknown
edited November 5 in Community Q&A
Hi!

I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.

I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.

I know that there are 2 rules important: what to follow and what to store.

I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280

What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+

Any help would be apreciated!

U.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hey U,

    on a quick check I got some pages with the following settings:
    url: http://www.realestate-slovenia.info/
    both rules: .+id.+

    And I also increased the max page size to 10000.

    As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?

    Best regards,
    Marius
  • Marius,

    the web page allows robots.

    Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.

    Tnx for helping.
  • MariusHelf
    MariusHelf New Altair Community Member
    Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.

    Best regards,
    Marius
  • Marius,

    I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.

    This is my Web crawler process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
            <parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&amp;]pg=.+ | id=.+)"/>
              <parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
            </list>
            <parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_depth" value="4"/>
            <parameter key="domain" value="server"/>
            <parameter key="max_page_size" value="10000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    As you can see I try to follow 3 types of URL, for example

    http://www.realestate-slovenia.info/nepremicnine.html?q=sale
    http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
    http://www.realestate-slovenia.info/nepremicnine.html?id=5744923

    And I want to store only one type of URL

    http://www.realestate-slovenia.info/nepremicnine.html?id=5469846

    So for the first task my rule is

    http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)

    Fpr the second task rule is:
    http://www.nepremicnine.net/nepremicnine.html?id=.+

    Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
    .+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.

    I would really like this process to work cause gathered data are the basis for my article.

    Tnx.