Is there known bugs in Web Mining: Crawl Web procedure? I have noticed several forum threads in web asking same question - but no answers. Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not. 1. This works: 2. But this does not although the logic is very same: I wonder why? Indeed, is there any way to see a bit more details - step-by-step what is the operator doing when parsing the page? So that you could maybe found out the reason by yourself? Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?

Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

[SOLVED] Web Mining: Crawl Web works or not - depending on site, bug or feature?

Is there known bugs in Web Mining: Crawl Web procedure? I have noticed several forum threads in web asking same question - but no answers.

Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not.

1. This works:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://uta.fi"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*tutkimus.*"/>
          <parameter key="follow_link_with_matching_url" value=".*tutkimus.*"/>
        </list>
        <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

2. But this does not although the logic is very same:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://kaksplus.fi/keskustelu/plussalaiset/mitas-nyt"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*keskustelu.*"/>
          <parameter key="follow_link_with_matching_url" value=".*keskustelu.*"/>
        </list>
        <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I wonder why? Indeed, is there any way to see a bit more details - step-by-step what is the operator doing when parsing the page? So that you could maybe found out the reason by yourself?

Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?

Find more posts tagged with

AI Studio

Accepted answers

All comments

pjworsfold

I had a similar issue, where RM would crawl some sites, but not others, I bumped up max page size to 1000kb and now it works very well.

number6

Pjdoubleyou, Thank you very much - it helped! The source URL-page was indeed over 100KB although the fetched pages were less.