"Web Crawler Crawling Rules [SOLVED]"

New Altair Community Member

Nov 30, 2012

Updated Nov 5, 2024 by Jocelyn

I don't understand how the web crawling rules are working. I've been trying to scrape a particular site and I'm pulling set of listings from the site in order to parse them but getting the regular expressions/rules to work has been challenging.

The root of my search is the something like the following:

http://www.mysite.com/browse/division

What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:

http://www.mysite.com/site/business-site-1

So...I'm am able to pull down all the pages with the following rules:

<parameter key="follow_link_with_matching_url" value=".*browse.*"/>
<parameter key="follow_link_with_matching_url" value=".*division.*"/>
<parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
<parameter key="follow_link_with_matching_url" value=".*site.*"/>
<parameter key="store_with_matching_url" value=".*site.*"/>

But the problem is that this casts too broad a net. I'm picking up links which have the following format: http://www.mysite.com/es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:

<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/division.*"/>
<parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/site/.*"/>
<parameter key="store_with_matching_url" value="/site/.*"/>

But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:

<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>

But this isn't working either. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.

Find more posts tagged with

AI Studio

Web Mining

Sort by:

1 - 2 of 21

MariusHelf

New Altair Community Member

Dec 3, 2012

Hi,

on Rapid-I.com the process below is working perfectly. Maybe you have to include the absolute url also in the store rule?

Best regards,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <process expanded="true" height="480" width="779">
      <operator activated="true" class="web:crawl_web" compatibility="5.2.004" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
        <parameter key="url" value="http://rapid-i.com"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://rapid-i.com/content/view/.*/1/lang,en/"/>
          <parameter key="store_with_matching_url" value="http://rapid-i.com/content/view/.*/1/lang,en/"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="max_depth" value="5"/>
        <parameter key="delay" value="100"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Datadude

New Altair Community Member

Dec 3, 2012

Ok,

Finally figured this out. It looks like you can only have one rule of each type although that isn't very clear from the interface. You can use the matching groups functionality to find matching phrases in the urls which works well for my use case. I'm not even using the captured groups but this helps match up a "word" in the url. Here are my 2 ( and only two) revised rules

<list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www.mysite.com/(browse/site|site).*"/>
          <parameter key="store_with_matching_url" value="http://wwwmysite.com/(browse/site|site).*"/>
</list>

"Web Crawler Crawling Rules [SOLVED]"

Find more posts tagged with

Quick Links