[SOLVED] Web Mining: Crawl Web works or not - depending on site, bug or feature?

number6
number6 New Altair Community Member
edited November 5 in Community Q&A
Is there known bugs in Web Mining: Crawl Web procedure? I have noticed several forum threads in web asking same question - but no answers.

Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not.

1. This works:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
       <parameter key="url" value="http://uta.fi"/>
       <list key="crawling_rules">
         <parameter key="store_with_matching_url" value=".*tutkimus.*"/>
         <parameter key="follow_link_with_matching_url" value=".*tutkimus.*"/>
       </list>
       <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
       <parameter key="extension" value="html"/>
       <parameter key="max_pages" value="100"/>
       <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
       <parameter key="obey_robot_exclusion" value="false"/>
       <parameter key="really_ignore_exclusion" value="true"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

2. But this does not although the logic is very same:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
       <parameter key="url" value="http://kaksplus.fi/keskustelu/plussalaiset/mitas-nyt"/>
       <list key="crawling_rules">
         <parameter key="store_with_matching_url" value=".*keskustelu.*"/>
         <parameter key="follow_link_with_matching_url" value=".*keskustelu.*"/>
       </list>
       <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
       <parameter key="extension" value="html"/>
       <parameter key="max_pages" value="100"/>
       <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
       <parameter key="obey_robot_exclusion" value="false"/>
       <parameter key="really_ignore_exclusion" value="true"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
I wonder why? Indeed, is there any way to see a bit more details - step-by-step what is the operator doing when parsing the page? So that you could maybe found out the reason by yourself?

Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?

Tagged:

Answers

  • I had a similar issue, where RM would crawl some sites, but not others, I bumped up max page size to 1000kb and now it works very well.

  • number6
    number6 New Altair Community Member
    Pjdoubleyou, Thank you very much - it helped! The source URL-page was indeed over 100KB although the fetched pages were less.