"Using Regex in the web crawler"
guitarslinger
New Altair Community Member
Hi there,
I am struggling with the setup of the crawlers in the web mining extension:
I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.
Can I find an example for crawling rules somewhere?
Thx in advance
GS
I am struggling with the setup of the crawlers in the web mining extension:
I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.
Can I find an example for crawling rules somewhere?
Thx in advance
GS
Tagged:
0
Answers
-
Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim0
-
Hi B_Miner, good point:
Here ist the XML, just having the crawler connected to the main process and having two rules:
1. follow every link ".*"
2. store every page ".*"<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="673" width="1094">
<operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="109" y="144">
<parameter key="url" value="http://www.aol.com"/>
<list key="crawling_rules">
<parameter key="3" value=".*"/>
<parameter key="1" value=".*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="output_dir" value="C:\Users\Martin\Desktop\crawltest"/>
<parameter key="max_depth" value="10"/>
<parameter key="max_page_size" value="1000"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Problem solved: I had no value in parameter "max. pages".
I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.
Works now, I am happy!
Regards GS
;D0 -
Well,
it should be optional. ****. I will make sure, it's optional in future
Good thing you got it to work, though.
Greetings,
Sebastian0