"Web Crawler problem"

New Altair Community Member

Apr 14, 2011

Updated Nov 5, 2024 by Jocelyn

Hi all,

i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.

Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards

mmarag

Find more posts tagged with

AI Studio

Web Mining

Sort by:

1 - 2 of 21

haddock

New Altair Community Member

Apr 14, 2011

Hi there Mmarag,

For the future, if you paste the XML of your process it makes it easier to check, for the present the following code appears to work, so I ponder where the "serious bug" really lies.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true" height="454" width="812">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="111" y="242">
        <parameter key="url" value="http://www.opengov.gr/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".*gr.*"/>
          <parameter key="store_with_matching_url" value=".*gr.*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="output_dir" value="C:\Documents and Settings\Administrator.KNOWLEDG-P6715Y\My Documents"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="62" y="117">
        <parameter key="url" value="http://www.opengov.gr/home/"/>
        <list key="query_parameters"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <connect from_op="Get Page" from_port="output" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

mmarag

New Altair Community Member

Apr 14, 2011

Dear Sir,

thank you very much for the rapid response.

Mmarag

🎉Community Raffle - Win $25

"Web Crawler problem"

Find more posts tagged with

Quick Links