Not following links
Dazzerman
New Altair Community Member
Hi,
I have adapted a Crawl Web process that worked elsewhere, that is not working in my latest example. All I have changed is the starting URL and the crawling rules.
Anyone know why this might be happening?
Thanks.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="165">
<parameter key="url" value="http://www.heatingspareparts.com/index.asp"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+\&gc=.+"/>
<parameter key="follow_link_with_matching_url" value=".+suppliername.+|.+\&gc=.+"/>
</list>
<parameter key="output_dir" value="C:\scratch\RapidMiner\Gas Council"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="3"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="user_agent" value="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; MDDRJS)"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I have adapted a Crawl Web process that worked elsewhere, that is not working in my latest example. All I have changed is the starting URL and the crawling rules.
Anyone know why this might be happening?
Thanks.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="165">
<parameter key="url" value="http://www.heatingspareparts.com/index.asp"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+\&gc=.+"/>
<parameter key="follow_link_with_matching_url" value=".+suppliername.+|.+\&gc=.+"/>
</list>
<parameter key="output_dir" value="C:\scratch\RapidMiner\Gas Council"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="3"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="user_agent" value="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; MDDRJS)"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
I have adapted this process to work correctly on yet another website, but still do not understand why it is not working for the details posted here.
Does anyone know what might be preventing this proces from producing results?
Thanks.0