Webcrawler Doubt

newbierapid
newbierapid New Altair Community Member
edited November 5 in Community Q&A
Hi All,

I am using RM 5.1 and I am currently experimenting with web mining.My objective is to crawl a web page and display according to the crawling rules. After applying the crawling rules I am not able to see any output.

Appreciate help and thanks in advance.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="503" width="604">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="122" y="119">
        <parameter key="url" value="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&amp;Sect2=HITOFF&amp;u=/netahtml/PTO/search-adv.htm&amp;r=0&amp;p=1&amp;f=S&amp;l=50&amp;Query=apple&amp;d=PTXT"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*(Apple)"/>
          <parameter key="store_with_matching_content" value=".*(Apple"/>
          <parameter key="follow_link_with_matching_text" value=".*(Apple"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="max_pages" value="5"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Thanks
Tagged:

Answers

  • colo
    colo New Altair Community Member
    Hi,

    it seems there are some closing brackets missing for the last two rules.

    There is one special thing to consider when using "store_with_matching_content": if you want the dot to match all symbols including line breaks, you have to activate the dot-all mode. This is possible by placing "(?s)" at the beginning of your expression. But this will make crawling slow, since whole webpages have to be scanned (see http://rapid-i.com/rapidforum/index.php/topic,2102.0.html).

    Regards
    Matthias
  • newbierapid
    newbierapid New Altair Community Member
    Hi Mathias,

    I have tried the the way you explained. Still i couldnt find the solution. Please find the xml code below.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="521" width="622">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="210">
            <parameter key="url" value="http://www.google.com/search?q=apple&amp;btnG=Search+Patents&amp;tbm=pts&amp;tbo=1&amp;hl=en"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_text" value="(?s).*apple.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Thanks
  • colo
    colo New Altair Community Member
    Hi,

    you're right, this is simply not working. I also can't obtain any pages for both of the URLs you tried (even without crawling rules, which means that all links should be followed). I tried some smaller webpage instead and this is working. Maybe those big pages block the crawler somehow?
    Certainly further investigation of the returned messages will be required, which means working with the source code. Or maybe I am also missing something necessary to get this working... Sorry.

    Regards
    Matthias
  • newbierapid
    newbierapid New Altair Community Member
    Thanks Mathias,