🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Crawl Web user agent is always Java/1.7_11

n0n0User: "n0n0"
New Altair Community Member
Updated by Jocelyn
Hello to the community.

Trying to crawl a website, i always get the mobile version of this site.
I thougth about an issue with the user agent, so I changed the user agent parameter in the Crawl Web process, but still get the same result.
I then tried to crawl the page http://whatsmyuseragent.com/ with the following parameters:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
        <parameter key="url" value="http://whatsmyuseragent.com/"/>
        <list key="crawling_rules"/>
        <parameter key="write_pages_into_files" value="true"/>
        <parameter key="add_pages_as_attribute" value="false"/>
        <parameter key="output_dir" value="C:\temp"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="0"/>
        <parameter key="domain" value="web"/>
        <parameter key="delay" value="1000"/>
        <parameter key="max_threads" value="1"/>
        <parameter key="max_page_size" value="100"/>
        <parameter key="user_agent" value="rapid-miner-crawler"/>
        <parameter key="obey_robot_exclusion" value="true"/>
        <parameter key="really_ignore_exclusion" value="false"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
and the C:\Temp\0.html file i get reports my user agent to be Java/1.7.0_11, and this whatever i set in the user agent field..

I'm on a Windows 8 x64 machine, using RapidMiner 5.3.008 and Web Mining Extension 5.3.0

Any advice?
Thank you
n0n0

Find more posts tagged with

Sort by:
1 - 1 of 11
    Hey,

    Seems to be a bug. For the "Get Page" operator it is working, but for "Crawl Web" and "Process documents from Web" not. I have created a ticket for this. We will come back to this thread once we have fixed this.

    Thank you for reporting
      Marcin