🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

network connection with Get Pages - operator

User: "miner"
New Altair Community Member
Updated by Jocelyn

Hi there,

 

for a test I created a list of 9 URL in an excel-sheet.

Now I´m trying to test the following process

 

Read Excel > Get Pages > Data to Documents > further processing...

 

When I set a breakpoint after Read Excel I get an example set of the 9 URLs.

As soon as I try Get Pages the result is an error saying "Could not connect to the specified URL. Please check your network connection."

Here is my process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" breakpoints="after" class="read_excel" compatibility="7.5.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\xxx\Desktop\Crawler\test_url.xls"/>
<parameter key="imported_cell_range" value="A1:A9"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="locale" value="German (Germany)"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="A.true.file_path.label"/>
</list>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="179" y="136">
<parameter key="link_attribute" value="A"/>
<parameter key="page_attribute" value="*"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
<parameter key="connection_timeout" value="100000"/>
<parameter key="read_timeout" value="100000"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="delay" value="random"/>
<parameter key="min_delay_amount" value="200"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="313" y="34">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="85">
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="45" y="136"/>
<operator activated="true" class="text:stem_snowball" compatibility="7.5.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="238"/>
<operator activated="true" class="text:generate_n_grams_characters" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Characters)" width="90" x="179" y="238">
<parameter key="keep_terms" value="true"/>
</operator>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Generate n-Grams (Characters)" to_port="document"/>
<connect from_op="Generate n-Grams (Characters)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

My network connection works fine. I tested with the Crawl Web - operator and this works fine.

I already changed connection timeout and read timeout parameter but with no effect.

Any ideas what the reason for this error could be?

 

Thanks

miner

Find more posts tagged with