"WEB crawler rules"
Hi!
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.
I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.
I know that there are 2 rules important: what to follow and what to store.
I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280
What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+
Any help would be apreciated!
U.
Tagged:
0
Answers
-
Hey U,
on a quick check I got some pages with the following settings:
url: http://www.realestate-slovenia.info/
both rules: .+id.+
And I also increased the max page size to 10000.
As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?
Best regards,
Marius0 -
Marius,
the web page allows robots.
Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.
Tnx for helping.0 -
Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.
Best regards,
Marius0 -
Marius,
I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.
This is my Web crawler process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
<parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&]pg=.+ | id=.+)"/>
<parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
</list>
<parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
<parameter key="extension" value="html"/>
<parameter key="max_depth" value="4"/>
<parameter key="domain" value="server"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
As you can see I try to follow 3 types of URL, for example
http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923
And I want to store only one type of URL
http://www.realestate-slovenia.info/nepremicnine.html?id=5469846
So for the first task my rule is
http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)
Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+
Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.
I would really like this process to work cause gathered data are the basis for my article.
Tnx.
0