"Complex Web Crawling Process With Sessions"
de_chris
New Altair Community Member
Hi there,
I'm trying to crawl jopposting from this site: http://jobboerse.arbeitsagentur.de/vamJB/startseite.html
I have to get apprenticeship posting from a certain region. Unfortunately the whole page makes use of sessions. So far I'm using the getpages operator to get a list of all apprenticeship posting from a certain region which is spread up into 7 pages with about 50 postings per page.
At the moment I'm trying to get all the posting links from each of the 7 pages and request each posting detail page. My guess is to use Extract Information operator to get the links but still trying to figure out the correct xpath queries to get the 50 posting detail links. I already get the first posting detail link but need some king of iteration enumeration for the rest. Any Ideas?
Also this process is gonna be very complex. Any hint of how to make it simple is welcome. The problem is I can only request one URL per get pages operator to keep the session.
[glow=red,2,300]Thanks in advance [/glow]
<code>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="5.3.005" expanded="true" height="60" name="Read CSV" width="90" x="380" y="165">
<parameter key="csv_file" value="links.txt"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (2)" width="90" x="514" y="165">
<parameter key="link_attribute" value="att1"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL"/>
</operator>
<operator activated="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="715" y="255">
<parameter key="old_name" value="URL"/>
<parameter key="new_name" value="URL1"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes" width="90" x="849" y="210">
<list key="function_descriptions">
<parameter key="URL2" value="concat(URL1,"&d_6827794_z=50")"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="112" name="Multiply" width="90" x="983" y="210"/>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="1184" y="390">
<list key="function_descriptions">
<parameter key="URL4" value="replace(URL2,"d_6827794_p=1","d_6827794_p=3")"/>
</list>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="1184" y="255">
<list key="function_descriptions">
<parameter key="URL3" value="replace(URL2,"d_6827794_p=1","d_6827794_p=2")"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (5)" width="90" x="1385" y="390">
<parameter key="link_attribute" value="URL4"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="text:extract_document" compatibility="5.3.000" expanded="true" height="76" name="Extract Document" width="90" x="1519" y="390">
<parameter key="attribute_name" value="gensym2"/>
<parameter key="example_index" value="1"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="1653" y="390">
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information (2)" width="90" x="313" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="o1" value="//a[@title='Zu den Details des Stellenangebots']"/>
<parameter key="o3" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots']"/>
<parameter key="o4" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots' and 4]"/>
<parameter key="o5" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/[2][name()='a' and @title='Zu den Details des Stellenangebots']"/>
<parameter key="o6" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots']"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (4)" width="90" x="1318" y="255">
<parameter key="link_attribute" value="URL3"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (3)" width="90" x="1117" y="120">
<parameter key="link_attribute" value="URL2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Get Pages (2)" to_port="Example Set"/>
<connect from_op="Get Pages (2)" from_port="Example Set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Get Pages (3)" to_port="Example Set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 3" to_op="Generate Attributes (3)" to_port="example set input"/>
<connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Get Pages (5)" to_port="Example Set"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Get Pages (4)" to_port="Example Set"/>
<connect from_op="Get Pages (5)" from_port="Example Set" to_op="Extract Document" to_port="example set"/>
<connect from_op="Extract Document" from_port="document" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 2"/>
<connect from_op="Get Pages (4)" from_port="Example Set" to_port="result 3"/>
<connect from_op="Get Pages (3)" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
</code>
I'm trying to crawl jopposting from this site: http://jobboerse.arbeitsagentur.de/vamJB/startseite.html
I have to get apprenticeship posting from a certain region. Unfortunately the whole page makes use of sessions. So far I'm using the getpages operator to get a list of all apprenticeship posting from a certain region which is spread up into 7 pages with about 50 postings per page.
At the moment I'm trying to get all the posting links from each of the 7 pages and request each posting detail page. My guess is to use Extract Information operator to get the links but still trying to figure out the correct xpath queries to get the 50 posting detail links. I already get the first posting detail link but need some king of iteration enumeration for the rest. Any Ideas?
Also this process is gonna be very complex. Any hint of how to make it simple is welcome. The problem is I can only request one URL per get pages operator to keep the session.
[glow=red,2,300]Thanks in advance [/glow]
<code>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="5.3.005" expanded="true" height="60" name="Read CSV" width="90" x="380" y="165">
<parameter key="csv_file" value="links.txt"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.text.attribute"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (2)" width="90" x="514" y="165">
<parameter key="link_attribute" value="att1"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="255">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="URL"/>
</operator>
<operator activated="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="715" y="255">
<parameter key="old_name" value="URL"/>
<parameter key="new_name" value="URL1"/>
<list key="rename_additional_attributes"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes" width="90" x="849" y="210">
<list key="function_descriptions">
<parameter key="URL2" value="concat(URL1,"&d_6827794_z=50")"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.005" expanded="true" height="112" name="Multiply" width="90" x="983" y="210"/>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="1184" y="390">
<list key="function_descriptions">
<parameter key="URL4" value="replace(URL2,"d_6827794_p=1","d_6827794_p=3")"/>
</list>
</operator>
<operator activated="true" class="generate_attributes" compatibility="5.3.005" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="1184" y="255">
<list key="function_descriptions">
<parameter key="URL3" value="replace(URL2,"d_6827794_p=1","d_6827794_p=2")"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (5)" width="90" x="1385" y="390">
<parameter key="link_attribute" value="URL4"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="text:extract_document" compatibility="5.3.000" expanded="true" height="76" name="Extract Document" width="90" x="1519" y="390">
<parameter key="attribute_name" value="gensym2"/>
<parameter key="example_index" value="1"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="1653" y="390">
<parameter key="keep_text" value="true"/>
<process expanded="true">
<operator activated="true" class="text:extract_information" compatibility="5.3.000" expanded="true" height="60" name="Extract Information (2)" width="90" x="313" y="30">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="o1" value="//a[@title='Zu den Details des Stellenangebots']"/>
<parameter key="o3" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots']"/>
<parameter key="o4" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots' and 4]"/>
<parameter key="o5" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/[2][name()='a' and @title='Zu den Details des Stellenangebots']"/>
<parameter key="o6" value="/*[name()='html']/*[name()='body']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='form']/*[name()='div']/*[name()='div']/*[name()='div']/*[name()='table']/*[name()='tbody']/*[name()='tr']/*[name()='td']/*[name()='div']/*[name()='a' and @title='Zu den Details des Stellenangebots']"/>
</list>
<list key="namespaces"/>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information (2)" to_port="document"/>
<connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (4)" width="90" x="1318" y="255">
<parameter key="link_attribute" value="URL3"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="5.3.000" expanded="true" height="60" name="Get Pages (3)" width="90" x="1117" y="120">
<parameter key="link_attribute" value="URL2"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0"/>
<parameter key="connection_timeout" value="10000000"/>
<parameter key="read_timeout" value="10000000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="fixed"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Get Pages (2)" to_port="Example Set"/>
<connect from_op="Get Pages (2)" from_port="Example Set" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Get Pages (3)" to_port="Example Set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 3" to_op="Generate Attributes (3)" to_port="example set input"/>
<connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Get Pages (5)" to_port="Example Set"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Get Pages (4)" to_port="Example Set"/>
<connect from_op="Get Pages (5)" from_port="Example Set" to_op="Extract Document" to_port="example set"/>
<connect from_op="Extract Document" from_port="document" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 2"/>
<connect from_op="Get Pages (4)" from_port="Example Set" to_port="result 3"/>
<connect from_op="Get Pages (3)" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
</code>
Tagged:
0
Answers
-
Typically a session is managed with a cookie stored in your web browser, so with the correct settings you should be able to use the same session for all you crawl operator, because the support cookies. At least the "Get Page" and "Get Pages" operator have a parameter called "accept_cookies". Set this to "all" and the new parameter "cookie_scope" to "global" for every "Get ..." operator you use.
Now all operator should use the same cookies including the session cookies.0 -
Thanks for the reply. The problem must have been the server.0