RM5 WebCrawler rules

leoderja
leoderja New Altair Community Member
edited November 5 in Altair RapidMiner
Dear users & RM team:

I did not find enough documentation about this node.

"The Word Vector Tool and the RapidMiner Text Plugin" says:
If several expressions are given for the same condition, they are treated a disjunction.
This allows to express DNF expressions for each individual condition.
Conditions of different types are combined by conjunction, i.e. all of the have to be fulfilled.
It seems that in RM5 this does not work. See bellow a little project that shows this situation.

I had to use "|" pipe in the RegEx to specify a disyunction of patterns.

Question: How can I specify a Negative rule? (i.e. if this condition is true, do not follow this link...)

Thank you.
Best regards, Leonardo Der Jachadurian Gorojans




I tried to crawl a tiny website. I specified 2 conditions: ".*nomenclador.*" OR ".*pantallas.*". I changed the order of the rules, and RM5 just obey only the first rule and ignore the following rules for the same condition.

Please, see the code and change the order of the rules...
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <process expanded="true" height="-20" width="-50">
     <operator activated="true" class="web:crawl_web" compatibility="5.1.004" expanded="true" height="60" name="Crawl Web" width="90" x="84" y="53">
       <parameter key="url" value="http://www.bitios.com.ar"/>
       <list key="crawling_rules">
         <parameter key="store_with_matching_url" value=".*nomenclador.*"/>
         <parameter key="store_with_matching_url" value=".*pantallas.*"/>
       </list>
       <parameter key="write_pages_into_files" value="false"/>
       <parameter key="add_pages_as_attribute" value="true"/>
       <parameter key="output_dir" value="C:\Users\USR\Desktop\FILES"/>
       <parameter key="extension" value="html"/>
       <parameter key="domain" value="server"/>
       <parameter key="delay" value="0"/>
       <parameter key="max_page_size" value="999"/>
       <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
Here the log for ".*nomenclador.*" as 1st rule:
Apr 7, 2012 9:15:21 PM INFO: No filename given for result file, using stdout for logging results!
Apr 7, 2012 9:15:21 PM INFO: Process //NewLocalRepository/Pruebas/Crawl_Bitios starts
Apr 7, 2012 9:15:21 PM INFO: Loading initial data.
Apr 7, 2012 9:15:21 PM INFO: Discarded page "http://www.bitios.com.ar" because url does not match filter rules.
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/index.html
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/quienessomos.html
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/comocomprar.html
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/consultas.html
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/nomenclador.html
Apr 7, 2012 9:15:21 PM INFO: Following link http://www.bitios.com.ar/calculadora.html
Apr 7, 2012 9:15:22 PM INFO: Discarded page "http://www.bitios.com.ar/index.html" because url does not match filter rules.
Apr 7, 2012 9:15:22 PM INFO: Discarded page "http://www.bitios.com.ar/quienessomos.html" because url does not match filter rules.
Apr 7, 2012 9:15:22 PM INFO: Discarded page "http://www.bitios.com.ar/comocomprar.html" because url does not match filter rules.
Apr 7, 2012 9:15:22 PM INFO: Discarded page "http://www.bitios.com.ar/consultas.html" because url does not match filter rules.
Apr 7, 2012 9:15:22 PM INFO: Following link http://www.bitios.com.ar/Consultas.html
Apr 7, 2012 9:15:22 PM INFO: Discarded page "http://www.bitios.com.ar/Consultas.html" because url does not match filter rules.
Apr 7, 2012 9:15:22 PM INFO: Storing page http://www.bitios.com.ar/nomenclador.html
Apr 7, 2012 9:15:22 PM INFO: Following link http://www.bitios.com.ar/nomenclador_1.html
Apr 7, 2012 9:15:22 PM INFO: Following link http://www.bitios.com.ar/nomenclador_pantallas.html
Apr 7, 2012 9:15:22 PM INFO: Following link http://www.bitios.com.ar/nomenclador-traumatologia-palm/NomencladorTraumatologia_v.2.00_070205_1.exe
Apr 7, 2012 9:15:22 PM INFO: Following link http://www.bitios.com.ar/nomenclador-traumatologia-palm/Bitios - Nomenclador de Traumatología v. 2.00 - Manual del Usuario.pdf
Apr 7, 2012 9:15:22 PM INFO: Storing page http://www.bitios.com.ar/nomenclador_1.html
Apr 7, 2012 9:15:22 PM INFO: Storing page http://www.bitios.com.ar/nomenclador_pantallas.html
Apr 7, 2012 9:15:24 PM INFO: Storing page http://www.bitios.com.ar/nomenclador-traumatologia-palm/NomencladorTraumatologia_v.2.00_070205_1.exe
Apr 7, 2012 9:15:24 PM INFO: Discarded page "http://www.bitios.com.ar/calculadora.html" because url does not match filter rules.
Apr 7, 2012 9:15:24 PM INFO: Following link http://www.bitios.com.ar/calculadora_pantallas.html
Apr 7, 2012 9:15:24 PM INFO: Discarded page "http://www.bitios.com.ar/calculadora_pantallas.html" because url does not match filter rules.
Apr 7, 2012 9:15:24 PM INFO: Saving results.
Apr 7, 2012 9:15:24 PM INFO: Process //NewLocalRepository/Pruebas/Crawl_Bitios finished successfully after 2 s
And here the log for ".*pantallas.*" as 1st rule:
Apr 7, 2012 9:16:14 PM INFO: Saved process definition at '//NewLocalRepository/Pruebas/Crawl_Bitios'.
Apr 7, 2012 9:16:14 PM INFO: No filename given for result file, using stdout for logging results!
Apr 7, 2012 9:16:14 PM INFO: Process //NewLocalRepository/Pruebas/Crawl_Bitios starts
Apr 7, 2012 9:16:14 PM INFO: Loading initial data.
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/index.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/quienessomos.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/comocomprar.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/consultas.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/nomenclador.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/calculadora.html
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/index.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/quienessomos.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/comocomprar.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/consultas.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/Consultas.html
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/Consultas.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/nomenclador.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/nomenclador_1.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/nomenclador_pantallas.html
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/nomenclador-traumatologia-palm/NomencladorTraumatologia_v.2.00_070205_1.exe
Apr 7, 2012 9:16:15 PM INFO: Following link http://www.bitios.com.ar/nomenclador-traumatologia-palm/Bitios - Nomenclador de Traumatología v. 2.00 - Manual del Usuario.pdf
Apr 7, 2012 9:16:15 PM INFO: Discarded page "http://www.bitios.com.ar/nomenclador_1.html" because url does not match filter rules.
Apr 7, 2012 9:16:15 PM INFO: Storing page http://www.bitios.com.ar/nomenclador_pantallas.html
Apr 7, 2012 9:16:18 PM INFO: Discarded page "http://www.bitios.com.ar/nomenclador-traumatologia-palm/NomencladorTraumatologia_v.2.00_070205_1.exe" because url does not match filter rules.
Apr 7, 2012 9:16:18 PM INFO: Discarded page "http://www.bitios.com.ar/calculadora.html" because url does not match filter rules.
Apr 7, 2012 9:16:18 PM INFO: Following link http://www.bitios.com.ar/calculadora_pantallas.html
Apr 7, 2012 9:16:18 PM INFO: Storing page http://www.bitios.com.ar/calculadora_pantallas.html
Apr 7, 2012 9:16:18 PM INFO: Saving results.
Apr 7, 2012 9:16:18 PM INFO: Process //NewLocalRepository/Pruebas/Crawl_Bitios finished successfully after 3 s
Tagged: