"First Steps in Webmining"

So I decided to get a bit deeper into rapidminer and defined my first challenge.
I want the crawler to get every posting of a blog which mentions a certain word:

First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:

<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
<parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>

So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?

Find more posts tagged with

AI Studio

Web Mining

Accepted answers

All comments

IngoRM

Hi,

If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does not have a valid parent." error.

There was a bug in the crawling operator which we have just fixed. Usually, there should be a dialog asking the user if crawling without obeying the "robots.txt" should really be performed since this might not legal / appropriate in all cases.

You can get the fixed version via CVS (the bug was in the text plugin, formerly known as "wvtool", hence the module name) and the bugfix will of course also be part of the next release.

Cheers,
Ingo

296M

compared to rapidminer, httrack is much more powerful and faster as a crawler.

that's why even textinput also provides tutorial for using httrack. ;D

Legacy User

Hi 296M, Hi All,

Did you know "webharvest" ( http://web-harvest.sourceforge.net/ , I do not remember if I have already talked of that )? It is a kind of high level scripting language that looks like XML, and aimed at specifying which type of harvesting task you want to perform. Assuming that you could call a WebHarvest script from RapidMiner, you could do exactly what you want...

@Ingo & Steffen :
May a "scripting box" for Webharvest be an interesting feature request ?

Cheers,
Jean-Charles.

rajbanokhan

i have a question how to get data from web sites behind the hyperlinks. on web page how we get data from hyperlinks which are used on every web page.

from raj

Thomas_Ott

Did you try this example? http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480