🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

'Crawl Web' not following certain links

User: "kludikovsky"
New Altair Community Member
Updated by Jocelyn

I am new to RM an trying to explore capabilities.

 

When I try to specify the links to follow, the required links where not followed as expected.

I have finally removed all 'imitations' and the links are still not followed.

One conclusion was that relative URL's are not handled properly. But this proved wrong with a test on the Site http://www.formel1.de/rennergebnisse/2016/grosser-preis-von-deutschland/rennen with 'rennergebnisse

I use this simple process:

 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="7.2.000" expanded="true" height="68" name="Crawl Web (2)" width="90" x="45" y="34">
<parameter key="url" value="https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&amp;StandortName=Innsbruck+Land&amp;Branche=3852&amp;BranchenName=Industrie&amp;CategoryID=0"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Users\Administrator\Documents\traRM"/>
<parameter key="extension" value="txt"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_depth" value="999"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="1000"/>
<parameter key="max_threads" value="1"/>
<parameter key="max_page_size" value="99000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<operator activated="false" class="store" compatibility="7.2.000" expanded="true" height="68" name="Store" width="90" x="514" y="34">
<parameter key="repository_entry" value="../data/WKO_Test"/>
</operator>
<connect from_op="Crawl Web (2)" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

The result is just one page and the log shows:

 

 

Aug 4, 2016 10:51:32 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only starts
Aug 4, 2016 10:51:34 AM INFO: Storing page https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&StandortName=Innsbruck+Land&Branche=3852&BranchenName=Industrie&CategoryID=0
Aug 4, 2016 10:51:34 AM INFO: Saving results.
Aug 4, 2016 10:51:34 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only finished successfully after 2 s

If I set the domain param from 'server' to 'web', other site-links are followed but still not those from within this site.

 

What I am doing wrong ? 

 

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Thomas_Ott"
    New Altair Community Member

    Hi kludikovsky,

     

    Are you trying to crawl all the links on the site?