"Text Mining-Crawler problem"
sijusony
New Altair Community Member
hi every one,
I am facing a problem while using crowler......i tried the following code.
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
<parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.
Can any one teel me wat the problem is .I am also attaching my logfiles also
P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
Root[1] (Process)
+- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes
(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)
P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds
I am facing a problem while using crowler......i tried the following code.
<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
<parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>
if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.
Can any one teel me wat the problem is .I am also attaching my logfiles also
P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
Root[1] (Process)
+- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes
(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)
P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds
Tagged:
0
Answers
-
Hi,
probably your crawling rules forbid the storing any page found. The parameter have the following meaning:
For more information see http://nemoz.org/joomla/content/view/64/53/lang,de/The following condition types are supported to specify which links to follow:
follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
link_text A link is only followed, if the link text contains all terms stated in this parameter.
The conditions that state whether to store a page or not allow for the following expressions:
visit_url A page is only stored if its URL contains all terms stated in this parameter.
visit_content A page is only stored if its content contains all terms stated in this parameter.
Greetings,
Sebastian0 -
hi Sebastian,
I tried with crawler for an intranet site, it is working fine.But when i am trying to crawl ,internet sites its giving me problem.
The user agent i am using is rapid-miner-crawler .For accessing intranet sites, do i hav to use any other useragents.
thank you for your quick replay.
greetings ,
Siju Sony Mathew0 -
Hi,
perhabs they forbid this type of user agent for their site, or even excluded crawlers in the robots.txt.
Greetings,
Sebastian0 -
hi,
Is there any other user agent by which the crawler can access the Webpages.
greetings,
Siju0 -
Hi,
the parameter user_agent in the crawler speciefies the string used to authenticate the client to the http server. You might put in arbitrary values, for example the values for internet explorer, firefox or something else. If its your own webpage you could even turn of "obey_robot_exclusion", causing the crawler to igonore bans within the robots.txt. But do this only if its your own page!
Greetings,
Sebastian0