"Crawling rules"

Xannix
Xannix New Altair Community Member
edited November 5 in Community Q&A
Hi,
I'm not sure if I don't understand the method but I don't know how to use the "store_with_matching_content" parameter.

I would like to store pages wich have one specific word (for example "euro"). I've tried to write:

a) Just the word: euro
b) A regular expression, for example: .*euro.*

What is the problem? Could someone explain me this?

Thanks : )

Answers

  • land
    land New Altair Community Member
    Hi,
    you have to enter a valid regular expression.
    Please post the process, so that I can take a look at your parameters.

    Greetings,
      Sebastian
  • colo
    colo New Altair Community Member
    I tried to use this rule some days ago without success. The other rules seem to work as expected, but there might be a an issue with matching the regular expression for store_with_matching_content. I entered several expressions and even .* didn't bring up any results. Does this problem derive from usage or from a little bug? ;)
  • Xannix
    Xannix New Altair Community Member
    Hi colo,
    I have the same problem, all the other rules work fine, but not this. Here is my example, crawling Rapid-i web:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="145" width="212">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
            <parameter key="url" value="http://rapid-i.com/index.php?lang=en"/>
            <list key="crawling_rules">
              <parameter key="2" value="http://rapid-i\.com/.*"/>
              <parameter key="1" value=".*Rapid.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="max_pages" value="2"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • land
    land New Altair Community Member
    Hi,
    what exactly happens with this rule? Does the operator always return an empty set or doesn't it finish work at all?

    Greetings,
      Sebastian
  • colo
    colo New Altair Community Member
    Hello Sebastian,

    it doesn't even result in an empty set. There simply are no results, after finishing the process the prompt for switching to results perspective shows up as usual. But there is only the empty result overview and nothing else...

    Regards,
    Matthias
  • haddock
    haddock New Altair Community Member
    Greets to all,

    Well, it is actually possible to get something from the webcrawler - the code below makes word vectors of the recent posts in this forum - but if you want to mine more than a few pages I'm not sure the websphinx library is that robust. The last version was released in 2002. Furthermore if I insert print statements in appropriate places and build the operators from scratch I can see results that are, shall we say, intriguing. Anyways, here's the creepy crawler...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="-20" width="-50">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="53" y="53">
            <parameter key="url" value="http://rapid-i.com/rapidforum/index.php?action=recent"/>
            <list key="crawling_rules">
              <parameter key="0" value="http://rapid-i.com/rapidforum.*"/>
              <parameter key="2" value="http://rapid-i.com/rapidforum.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Administrator\My Documents\WebCrawler"/>
            <parameter key="max_pages" value="10"/>
            <parameter key="max_depth" value="3"/>
            <parameter key="max_threads" value="12"/>
            <parameter key="user_agent" value="haddock checking rapid-miner-crawler"/>
            <parameter key="obey_robot_exclusion" value="false"/>
            <parameter key="really_ignore_exclusion" value="true"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="360" y="46">
            <list key="specify_weights"/>
            <process expanded="true" height="353" width="808">
              <operator activated="true" class="web:unescape_html" expanded="true" height="60" name="Unescape Content" width="90" x="187" y="28"/>
              <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="400" y="26"/>
              <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="543" y="26"/>
              <connect from_port="document" to_op="Unescape Content" to_port="document"/>
              <connect from_op="Unescape Content" from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Par contre, if I use a RetrievePagesOperator on the output from an RSS Feed operator all works fine.


    Toodles


  • land
    land New Altair Community Member
    Hi,
    I switched the regular expression to dotall mode, so that . also replaces line breaks. This solves the issue that the regular expression doesn't match the document, but takes far tooooooo long time for building a regular expression of a website with 120kb. I think we will have to bury this option in the current incarnation.
    Any idea how to replace it, beside simply switching to string matching anyway?

    Greetings,
      Sebastian

    PS:
    If anybody knows another, powerful open-source web crawler, that's usable from java: I would be gladly to replace that "creepy" sphinx.

  • haddock
    haddock New Altair Community Member
    Greets Seb,

    I'm cannibalising the sphinx at the moment, and working on tokens rather than strings, as well as using the header fields, description, keywords, etc., which are regex friendly, and can be pre-fetched. I've also started looking at Heretrix. Something may emerge  ;)

    Ciao
  • land
    land New Altair Community Member
    Hi,
    thanks for the hint on Heritrix. This really seams worth the effort. Uhm, now I only need somebody to pay me for implementing this. Any volunteers? :)
    Does anybody have negative experience with this crawler? Otherwise I will add it to the feature request list.

    Greetings,
      Sebastian
  • Xannix
    Xannix New Altair Community Member
    So... uhmm..., isn't posible to crawl with "store_with_matching_content" parameter?

    Actually, I do in this way:

    [1] Crawl Web ->
    [2] Generate Extract ->
    [3] Filter Examples

    [1]: I don't use "store_with_matching_content"
    [2]: I extract text with xPath because the parameter "attribute_value_filter" of the "Filter Examples" operator doesn't work if it find any html tag. Is that normal or not?
    [3]: I select the only examples which match content

    I know that works fine, but I think that is not efficient...

    Any idea?

    Thanks : ))
  • land
    land New Altair Community Member
    Hi,
    this depends on the regular expression used, but I guess you will have to switch to dotall mode, because normally there's a linebreak behind the tag and per default the . character does not include line breaks.

    Greetings,
      Sebastian
  • Xannix
    Xannix New Altair Community Member
    Hi,
    where can I find the "dotall mode" option?

    Thanks
  • Xannix
    Xannix New Altair Community Member
    Sorry, I realized I was wrong...

    I've been testing again, if you want to find the word "Euro" in the content you can write:

    [\S\s]*Euro[\S\s]*

    maybe is a little slow, but it works.

    Thanks for all : )
  • colo
    colo New Altair Community Member
    Hello Xannix,

    if you want to use options/modifications in your expressions you can simply use them by (?x) in your regex. The "x" specifies which option to use, for the "dotall"-option this would be "s". I think it's an easy and clean way to set all options at the beginning of your regex. For your "Euro" seach it would read as follows:

    (?s).*Euro.*
  • Xannix
    Xannix New Altair Community Member
    Hi, Colo, thanks, I'll try it : )