"Web Crawler Crawling Rules [SOLVED]"

User: "Datadude"
New Altair Community Member
Updated by Jocelyn
I don't understand how the web crawling rules are working.  I've been trying to scrape a particular site and I'm pulling set of listings from the site in order to parse them but getting the regular expressions/rules to work has been challenging.

The root of my search is the something like the following:

http://www.mysite.com/browse/division

What I'm trying to is pull down all the business site page which are found on the site.  These page are found with the following format:

http://www.mysite.com/site/business-site-1

So...I'm am able to pull down all the pages with the following rules:

         <parameter key="follow_link_with_matching_url" value=".*browse.*"/>
         <parameter key="follow_link_with_matching_url" value=".*division.*"/>
         <parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
         <parameter key="follow_link_with_matching_url" value=".*site.*"/>
         <parameter key="store_with_matching_url" value=".*site.*"/>

But the problem is that this casts too broad a net.  I'm picking up links which have the following format:  http://www.mysite.com/es/site/business-site-1.  They're in Spanish so I don't want 'em.   I don't know how to exclude.  My latest attempt is the following:

         <parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
         <parameter key="follow_link_with_matching_url" value="/division.*"/>
         <parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
         <parameter key="follow_link_with_matching_url" value="/site/.*"/>
         <parameter key="store_with_matching_url" value="/site/.*"/>

But this doesn't work.  The actual links in the source use relative links:  /site/business-site-1.  Is the Rapid Miner crawler resolving these links to absolute form?  I've also tried fully realizing the absolute paths in the rules like so:  

<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>

But this isn't working either.  Is there something going on here with the order of the rules themselves?  Are the rules OR 'ed.  I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.

Find more posts tagged with