Hello,
I have been very impressed with Rapid miner ever since i came across this program. So far, my experience with Rapid miner was superb and things did move on smoothly, not until i was stuck with this problem......
i'm currently working on an assignment, whereby we have to analyst the trend in the IT working industry(e.g. IT jobs that are highly in demand within the working industry).
i have read through some of the post on the forum but couldn't seem to find the answers to my question. The problem is that i wasnt able to crawl all the jobs that are IT related from job recruiting websites such as
www.jobstreet.com.sg.
With these parameters set, the maximum amount of pages i could crawl would be roughly around 100 txt files with the source code in it, but most of the time only 60 of the them are what im looking for which are job information(e.g.
http://www.jobstreet.com.sg/jobs/2011/7/a/20/2666559.htm?fr=J). The rest would be search result(e.g.
http://job-search.jobstreet.com.sg/singapore/job-opening.php?area=1&;option=1&specialization=192&job-posted=0&src=19&sort=1&order=0&classified=1&job-source=64&src=19&srcr=2).
This is how i set the parameters.For the depth, ive set it as: 2
I've set the URL as: http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/i've also set the rule as:store_with_matching_content: .*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*
follow_link_with_matching_text: .*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*
This will be the Xml codes:<process version="5.1.006">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
<process expanded="true" height="190" width="145">
<operator activated="true" class="web:crawl_web" compatibility="5.1.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="75">
<parameter key="url" value="
http://job-search.jobstreet.com.sg/singapore/computer-information-technology-jobs/"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_text" value=".*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*"/>
<parameter key="store_with_matching_content" value=".*(IT|NETWORK|COMPUTER|APPLICATION|ANALYST|SOFTWARE|DATABASE|HARDWARE).*"/>
</list>
<parameter key="output_dir" value="C:\Users\student\Desktop\CRAWLED RESULT"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
it would be awesome if anyone could share their knowledge with us and give us a few tips or provide us with a step by step guide on the ways to obtain the information we want. Would very much appreciate the help given.
Thanks,