"Problem with RapidMiner Crawler"

lexusboy
lexusboy New Altair Community Member
edited November 5 in Community Q&A
Hello,

I started using RapidMiner recently for crawling web sites. However, I have been facing some problems with some web sites when I use RM,  I really like RapidMiner's performance including the ease with which you can configure to suit your needs, and I want to stick with it, so any help would be appreciated.

Here is a snapshot from my log file

P May 20, 2009 3:29:59 PM: Initialising process setup
P May 20, 2009 3:29:59 PM: [NOTE] No filename given for result file, using stdout for logging results!
P May 20, 2009 3:29:59 PM: Checking properties...
P May 20, 2009 3:29:59 PM: Properties are ok.
P May 20, 2009 3:29:59 PM: Checking process setup...
P May 20, 2009 3:29:59 PM: Inner operators are ok.
P May 20, 2009 3:29:59 PM: Checking i/o classes...
P May 20, 2009 3:29:59 PM: i/o classes are ok. Process output: ExampleSet, NumericalMatrix.
P May 20, 2009 3:29:59 PM: Process ok.
P May 20, 2009 3:29:59 PM: Process initialised
P May 20, 2009 3:29:59 PM: [NOTE] Process starts
P May 20, 2009 3:29:59 PM: Process:
  Root[0] (Process)
  +- Crawler[0] (Crawler)
G May 20, 2009 3:29:59 PM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of Crawler (Crawler)
G May 20, 2009 3:29:59 PM: [Fatal] Process failed: operator cannot be executed (0). Check the log messages...
          Root[1] (Process)
here ==> +- Crawler[1] (Crawler)

Answers

  • land
    land New Altair Community Member
    Hi,
    unfortunately I cannot see anything from this log, beside that there is an error :) If you could post your process containing the crawler here, I would be able to reproduce the error and hence could try to resolve it.

    Greetings,
      Sebastian
  • lexusboy
    lexusboy New Altair Community Member
    Hello Sebastian,

    Here is my process in XML structure, hope this is what you meant :)

    <operator name="Root" class="Process" expanded="yes">
        <operator name="Crawler" class="Crawler">
            <parameter key="url" value="http://www.triathlon-szene.de/forum/"/>
            <list key="crawling_rules">
              <parameter key="visit_content" value="&quot;new balance&quot;"/>
              <parameter key="follow_url" value="laufforum.de"/>
            </list>
            <parameter key="delay" value="0"/>
            <parameter key="max_threads" value="3"/>
            <parameter key="output_dir" value="C:\Documents and Settings\Bhavya\My Documents\RapidMiner\laufforum"/>
        </operator>
    </operator>

    Thanks in advance!

    Regards,
    Bhavya
  • land
    land New Altair Community Member
    Hi,
    thank you. I will take a look at it, as soon as possible. But the error seems to be in the wvtool, which makes debugging a lot more complex :)

    Greetings,
      Sebastian
  • lexusboy
    lexusboy New Altair Community Member
    Hello,

    Thank you Sebastian, I hope you can find a solution for this problem :)

    Best Regards,
    Bhavya
  • miwahattori
    miwahattori New Altair Community Member
    Hello all,

    I'm wondering if there was any resolution to this problem.  I'm using v.4.6 of the Plug-in and getting the same exact error.  My process is essentially the same as Bhavya's on this thread, so rather than start a new post I'm looking for any follow up on this.  The error occurs only with some starting URLs.

    Any guidance will be appreciated!
    Miwa
  • land
    land New Altair Community Member
    Hi,
    sorry for the late answer, but I guess that's not a bug in the rapid miner crawler, but instead the forums simply forbade robots crawling their forum in their robots.txt. The crawler obeys this rule as long as obey_robot_exclusion is checked. This setting should not be changed, as long as the website owner does not allow you to scroll it's page.
    Another possibility is, that the forum only allows user agents, which are identified as browser.

    Greetings,
      Sebastian
  • miwahattori
    miwahattori New Altair Community Member
    Sebastian,

    Thanks for your response.  I had the obey_robot_exclusion rule unchecked as I was running a test retrieval on our own organization's homepage, and I was still getting the error.  However, there was a workaround-- our intuition is that the homepage I was trying to crawl had too many links thus causing an array index overflow.  Our homepage housed a list of 1000+ links.  After splitting the list into smaller partitions by creating multiple html files to use as starting URLs, each containing about 200 links, the crawler ran without an error on each.  If you could confirm whether the number of links being too numerous is known to cause an issue on the crawler, I would very much appreciate that, but in the mean time we are able to continue on with this workaround.

    Best regards,
    Miwa

  • land
    land New Altair Community Member
    Hi,
    I have never heard of such problems. Shortly we will revise the crawler, I will then take this hint into account.

    Greetings,
      Sebastian