"Web Crawler Crawling Rules [SOLVED]"
I don't understand how the web crawling rules are working. I've been trying to scrape a particular site and I'm pulling set of listings from the site in order to parse them but getting the regular expressions/rules to work has been challenging.
The root of my search is the something like the following:
http://www.mysite.com/browse/division
What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:
http://www.mysite.com/site/business-site-1
So...I'm am able to pull down all the pages with the following rules:
<parameter key="follow_link_with_matching_url" value=".*browse.*"/>
<parameter key="follow_link_with_matching_url" value=".*division.*"/>
<parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
<parameter key="follow_link_with_matching_url" value=".*site.*"/>
<parameter key="store_with_matching_url" value=".*site.*"/>
But the problem is that this casts too broad a net. I'm picking up links which have the following format: http://www.mysite.com/es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/division.*"/>
<parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/site/.*"/>
<parameter key="store_with_matching_url" value="/site/.*"/>
But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>
But this isn't working either. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.
The root of my search is the something like the following:
http://www.mysite.com/browse/division
What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:
http://www.mysite.com/site/business-site-1
So...I'm am able to pull down all the pages with the following rules:
<parameter key="follow_link_with_matching_url" value=".*browse.*"/>
<parameter key="follow_link_with_matching_url" value=".*division.*"/>
<parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
<parameter key="follow_link_with_matching_url" value=".*site.*"/>
<parameter key="store_with_matching_url" value=".*site.*"/>
But the problem is that this casts too broad a net. I'm picking up links which have the following format: http://www.mysite.com/es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/division.*"/>
<parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/site/.*"/>
<parameter key="store_with_matching_url" value="/site/.*"/>
But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>
But this isn't working either. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.