"Web Crawler problem"

mmarag
mmarag New Altair Community Member
edited November 5 in Community Q&A
Hi all,

i am phasing a serious bug when using the web crawler or the process documents from web processes. I am attempting to run a simple opinion mining experiment on http://www.opengov.gr/ web site, which according to the robots.txt file allows every agent freely.

Howeever, nothing happens and there is nothing in my log as well. I did not use any rule for your information. Kind regards

mmarag

Answers

  • haddock
    haddock New Altair Community Member
    Hi there Mmarag,

    For the future, if you paste the XML of your process it makes it easier to check, for the present the following code appears to work, so I ponder where the "serious bug" really lies.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true" height="454" width="812">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="111" y="242">
            <parameter key="url" value="http://www.opengov.gr/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value=".*gr.*"/>
              <parameter key="store_with_matching_url" value=".*gr.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Administrator.KNOWLEDG-P6715Y\My Documents"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="obey_robot_exclusion" value="false"/>
            <parameter key="really_ignore_exclusion" value="true"/>
          </operator>
          <operator activated="true" class="web:get_webpage" compatibility="5.1.000" expanded="true" height="60" name="Get Page" width="90" x="62" y="117">
            <parameter key="url" value="http://www.opengov.gr/home/"/>
            <list key="query_parameters"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <connect from_op="Get Page" from_port="output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • mmarag
    mmarag New Altair Community Member
    Dear Sir,

    thank you very much for the rapid response.

    Mmarag