🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Using Regex in the web crawler"

User: "guitarslinger"
New Altair Community Member
Updated by Jocelyn
Hi there,

I am struggling with the setup of the crawlers in the web mining extension:

I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.

Can I find an example for crawling rules somewhere?

Thx in advance

GS

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "B_Miner"
    New Altair Community Member
    Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim
    User: "guitarslinger"
    New Altair Community Member
    OP
    Hi B_Miner, good point:

    Here ist the XML, just having the crawler connected to the main process and having two rules:
    1. follow every link ".*"
    2. store every page ".*"
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="673" width="1094">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="109" y="144">
            <parameter key="url" value="http://www.aol.com"/>
            <list key="crawling_rules">
              <parameter key="3" value=".*"/>
              <parameter key="1" value=".*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="output_dir" value="C:\Users\Martin\Desktop\crawltest"/>
            <parameter key="max_depth" value="10"/>
            <parameter key="max_page_size" value="1000"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    User: "guitarslinger"
    New Altair Community Member
    OP
    Problem solved: I had no value in parameter "max. pages".

    I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.

    Works now, I am happy!

    Regards GS
    ;D
    User: "land"
    New Altair Community Member
    Well,
    it should be optional. ****. I will make sure, it's optional in future :)
    Good thing you got it to work, though.

    Greetings,
      Sebastian