[SOLVED] RM5 does not store the pages according to the specified rules...

User: "leoderja"
New Altair Community Member
Updated by Jocelyn
I am trying of crawl an online newspaper. I specified rules for navigating trough the previous editions, and I need to store only the individual news (matching_url = .+deportes/8.+), not the index pages where they are listed...
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <process expanded="true" height="-20" width="-50">
     <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="84" y="53">
       <parameter key="url" value="http://www.pagina12.com.ar"/>
       <list key="crawling_rules">
         <parameter key="follow_link_with_matching_url" value=".+principal/index.+|.+deportes/index.+|.+deportes/8.+"/>
         <parameter key="store_with_matching_url" value=".+deportes/8.+"/>
       </list>
       <parameter key="output_dir" value="C:\Users\USR\Desktop\FILES"/>
       <parameter key="extension" value="html"/>
       <parameter key="max_depth" value="9999999"/>
       <parameter key="domain" value="server"/>
       <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
But this does not work... Please, see the log... Nothing is stored...
Apr 1, 2012 11:36:36 PM INFO: Process //NewLocalRepository/Pruebas/Crawler starts
Apr 1, 2012 11:36:36 PM INFO: Loading initial data.
Apr 1, 2012 11:36:37 PM INFO: Discarded page "http://www.pagina12.com.ar" because url does not match filter rules.
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190886-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190902-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190872-2012-04-01.html
Apr 1, 2012 11:36:37 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190897-2012-04-01.html
Apr 1, 2012 11:36:39 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-31.html" because url does not match filter rules.
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190801-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190811-2012-03-31.html
Apr 1, 2012 11:36:39 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190840-2012-03-31.html
Apr 1, 2012 11:36:42 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-30.html" because url does not match filter rules.
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190725-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190718-2012-03-30.html
Apr 1, 2012 11:36:42 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190743-2012-03-30.html
Apr 1, 2012 11:36:44 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-29.html" because url does not match filter rules.
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/index-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190641-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190635-2012-03-29.html
Apr 1, 2012 11:36:44 PM INFO: Following link http://www.pagina12.com.ar/diario/deportes/8-190650-2012-03-29.html
Apr 1, 2012 11:36:47 PM INFO: Discarded page "http://www.pagina12.com.ar/diario/principal/index-2012-03-28.html" because url does not match filter rules.
bla, bla, bla...
bla, bla, bla...
bla, bla, bla...
What can be wrong? There is a bug in RM5's WebCrawler? Or I am doing some wrong?

Thank you in advance.
Leonardo Der Jachadurian Gorojans

Find more posts tagged with