Crawl Web user agent is always Java/1.7_11

n0n0
n0n0 New Altair Community Member
edited November 2024 in Community Q&A
Hello to the community.

Trying to crawl a website, i always get the mobile version of this site.
I thougth about an issue with the user agent, so I changed the user agent parameter in the Crawl Web process, but still get the same result.
I then tried to crawl the page http://whatsmyuseragent.com/ with the following parameters:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://whatsmyuseragent.com/"/>
        <list key="crawling_rules"/>
        <parameter key="write_pages_into_files" value="true"/>
        <parameter key="add_pages_as_attribute" value="false"/>
        <parameter key="output_dir" value="C:\temp"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="0"/>
        <parameter key="domain" value="web"/>
        <parameter key="delay" value="1000"/>
        <parameter key="max_threads" value="1"/>
        <parameter key="max_page_size" value="100"/>
        <parameter key="user_agent" value="rapid-miner-crawler"/>
        <parameter key="obey_robot_exclusion" value="true"/>
        <parameter key="really_ignore_exclusion" value="false"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
and the C:\Temp\0.html file i get reports my user agent to be Java/1.7.0_11, and this whatever i set in the user agent field..

I'm on a Windows 8 x64 machine, using RapidMiner 5.3.008 and Web Mining Extension 5.3.0

Any advice?
Thank you
n0n0

Answers

  • Skirzynski
    Skirzynski New Altair Community Member
    Hey,

    Seems to be a bug. For the "Get Page" operator it is working, but for "Crawl Web" and "Process documents from Web" not. I have created a ticket for this. We will come back to this thread once we have fixed this.

    Thank you for reporting
      Marcin

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.