"Web Crawler Crawling Rules [SOLVED]"
Datadude
New Altair Community Member
I don't understand how the web crawling rules are working. I've been trying to scrape a particular site and I'm pulling set of listings from the site in order to parse them but getting the regular expressions/rules to work has been challenging.
The root of my search is the something like the following:
http://www.mysite.com/browse/division
What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:
http://www.mysite.com/site/business-site-1
So...I'm am able to pull down all the pages with the following rules:
<parameter key="follow_link_with_matching_url" value=".*browse.*"/>
<parameter key="follow_link_with_matching_url" value=".*division.*"/>
<parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
<parameter key="follow_link_with_matching_url" value=".*site.*"/>
<parameter key="store_with_matching_url" value=".*site.*"/>
But the problem is that this casts too broad a net. I'm picking up links which have the following format: http://www.mysite.com/es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/division.*"/>
<parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/site/.*"/>
<parameter key="store_with_matching_url" value="/site/.*"/>
But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>
But this isn't working either. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.
The root of my search is the something like the following:
http://www.mysite.com/browse/division
What I'm trying to is pull down all the business site page which are found on the site. These page are found with the following format:
http://www.mysite.com/site/business-site-1
So...I'm am able to pull down all the pages with the following rules:
<parameter key="follow_link_with_matching_url" value=".*browse.*"/>
<parameter key="follow_link_with_matching_url" value=".*division.*"/>
<parameter key="follow_link_with_matching_url" value=".*browse/division.*"/>
<parameter key="follow_link_with_matching_url" value=".*site.*"/>
<parameter key="store_with_matching_url" value=".*site.*"/>
But the problem is that this casts too broad a net. I'm picking up links which have the following format: http://www.mysite.com/es/site/business-site-1. They're in Spanish so I don't want 'em. I don't know how to exclude. My latest attempt is the following:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/division.*"/>
<parameter key="follow_link_with_matching_url" value="/browse/division.*"/>
<parameter key="follow_link_with_matching_url" value="/site/.*"/>
<parameter key="store_with_matching_url" value="/site/.*"/>
But this doesn't work. The actual links in the source use relative links: /site/business-site-1. Is the Rapid Miner crawler resolving these links to absolute form? I've also tried fully realizing the absolute paths in the rules like so:
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/site/.*"/>
But this isn't working either. Is there something going on here with the order of the rules themselves? Are the rules OR 'ed. I"m struggling a little here and the regular expressions seem to work fine out the Web Crawler context.
Tagged:
0
Answers
-
Hi,
on Rapid-I.com the process below is working perfectly. Maybe you have to include the absolute url also in the store rule?
Best regards,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="480" width="779">
<operator activated="true" class="web:crawl_web" compatibility="5.2.004" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
<parameter key="url" value="http://rapid-i.com"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://rapid-i.com/content/view/.*/1/lang,en/"/>
<parameter key="store_with_matching_url" value="http://rapid-i.com/content/view/.*/1/lang,en/"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="max_pages" value="10"/>
<parameter key="max_depth" value="5"/>
<parameter key="delay" value="100"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Ok,
Finally figured this out. It looks like you can only have one rule of each type although that isn't very clear from the interface. You can use the matching groups functionality to find matching phrases in the urls which works well for my use case. I'm not even using the captured groups but this helps match up a "word" in the url. Here are my 2 ( and only two) revised rules<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://www.mysite.com/(browse/site|site).*"/>
<parameter key="store_with_matching_url" value="http://wwwmysite.com/(browse/site|site).*"/>
</list>0