"Using Regex in the web crawler"

New Altair Community Member

Apr 14, 2010

Updated Nov 5, 2024 by Jocelyn

Hi there,

I am struggling with the setup of the crawlers in the web mining extension:

I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.

Can I find an example for crawling rules somewhere?

Thx in advance

GS

Find more posts tagged with

AI Studio

Web Mining

RegEx

Sort by:

1 - 4 of 41

B_Miner

New Altair Community Member

Apr 14, 2010

Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim

guitarslinger

New Altair Community Member

Apr 15, 2010

Hi B_Miner, good point:

Here ist the XML, just having the crawler connected to the main process and having two rules:
1. follow every link ".*"
2. store every page ".*"

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="673" width="1094">
      <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="109" y="144">
        <parameter key="url" value="http://www.aol.com"/>
        <list key="crawling_rules">
          <parameter key="3" value=".*"/>
          <parameter key="1" value=".*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="output_dir" value="C:\Users\Martin\Desktop\crawltest"/>
        <parameter key="max_depth" value="10"/>
        <parameter key="max_page_size" value="1000"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

guitarslinger

New Altair Community Member

Apr 15, 2010

Problem solved: I had no value in parameter "max. pages".

I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.

Works now, I am happy!

Regards GS
;D

land

New Altair Community Member

Apr 16, 2010

Well,
it should be optional. ****. I will make sure, it's optional in future

Good thing you got it to work, though.

Greetings,
Sebastian

🎉Community Raffle - Win $25

"Using Regex in the web crawler"

Find more posts tagged with

Quick Links