"WEB crawler rules"

phdslovenia
phdslovenia New Altair Community Member
edited November 2024 in Community Q&A
Hi!

I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.

I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.

I know that there are 2 rules important: what to follow and what to store.

I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280

What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+

Any help would be apreciated!

U.

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hey U,

    on a quick check I got some pages with the following settings:
    url: http://www.realestate-slovenia.info/
    both rules: .+id.+

    And I also increased the max page size to 10000.

    As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?

    Best regards,
    Marius
  • phdslovenia
    phdslovenia New Altair Community Member
    Marius,

    the web page allows robots.

    Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.

    Tnx for helping.
  • MariusHelf
    MariusHelf New Altair Community Member
    Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.

    Best regards,
    Marius
  • phdslovenia
    phdslovenia New Altair Community Member
    Marius,

    I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.

    This is my Web crawler process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
            <parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&amp;]pg=.+ | id=.+)"/>
              <parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
            </list>
            <parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_depth" value="4"/>
            <parameter key="domain" value="server"/>
            <parameter key="max_page_size" value="10000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    As you can see I try to follow 3 types of URL, for example

    http://www.realestate-slovenia.info/nepremicnine.html?q=sale
    http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
    http://www.realestate-slovenia.info/nepremicnine.html?id=5744923

    And I want to store only one type of URL

    http://www.realestate-slovenia.info/nepremicnine.html?id=5469846

    So for the first task my rule is

    http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)

    Fpr the second task rule is:
    http://www.nepremicnine.net/nepremicnine.html?id=.+

    Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
    .+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.

    I would really like this process to work cause gathered data are the basis for my article.

    Tnx.