"Web Crawler problem"
mmarag
New Altair Community Member
Hi all,
i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.
Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards
mmarag
i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.
Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards
mmarag
Tagged:
0
Answers
-
Hi there Mmarag,
For the future, if you paste the XML of your process it makes it easier to check, for the present the following code appears to work, so I ponder where the "serious bug" really lies.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<parameter key="encoding" value="UTF-8"/>
<process expanded="true" height="454" width="812">
<operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="111" y="242">
<parameter key="url" value="http://www.opengov.gr/"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*gr.*"/>
<parameter key="store_with_matching_url" value=".*gr.*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Documents and Settings\Administrator.KNOWLEDG-P6715Y\My Documents"/>
<parameter key="max_pages" value="10"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="62" y="117">
<parameter key="url" value="http://www.opengov.gr/home/"/>
<list key="query_parameters"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<connect from_op="Get Page" from_port="output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
Dear Sir,
thank you very much for the rapid response.
Mmarag
0