So I decided to get a bit deeper into rapidminer and defined my first challenge.
I want the crawler to get every posting of a blog which mentions a certain word:
First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
<parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
<parameter key="url" value="
http://www.spreeblick.com/"/>
</operator>
</operator>
So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?