🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Problem with extensional Operator "Get Pages"

jhillerUser: "jhiller"
New Altair Community Member
Updated by Jocelyn

Hi,

 

I have a problem with the Operator "Get Pages" from Web Mining Extension.

It seems like that the operator is having a coding problem with UTF-8 charakters such aus "Ü".

With Mozilla Firefox I get a json-response with results after calling the URL "https://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5".

By calling this URL via Operator "Get Pages" I get a json-result but without an search-result.

 

Thats my test-process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.5.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
<parameter key="target_function" value="random"/>
<parameter key="number_examples" value="1"/>
<parameter key="number_of_attributes" value="1"/>
<parameter key="attributes_lower_bound" value="-10.0"/>
<parameter key="attributes_upper_bound" value="10.0"/>
<parameter key="gaussian_standard_deviation" value="10.0"/>
<parameter key="largest_radius" value="10.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
<list key="function_descriptions">
<parameter key="att1" value="&quot;https://itunes.apple.com/search?term=\&quot;Google Übersetzer\&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5&quot;"/>
</list>
<parameter key="keep_all" value="true"/>
</operator>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPage" width="90" x="313" y="34">
<parameter key="link_attribute" value="att1"/>
<parameter key="page_attribute" value="html"/>
<parameter key="random_user_agent" value="false"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
<parameter key="connection_timeout" value="2000"/>
<parameter key="read_timeout" value="2000"/>
<parameter key="follow_redirects" value="true"/>
<parameter key="accept_cookies" value="none"/>
<parameter key="cookie_scope" value="global"/>
<parameter key="request_method" value="POST"/>
<parameter key="delay" value="random"/>
<parameter key="delay_amount" value="5000"/>
<parameter key="min_delay_amount" value="2000"/>
<parameter key="max_delay_amount" value="5000"/>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="getPage" to_port="Example Set"/>
<connect from_op="getPage" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Can you reproduce the issue and do you think that this is a bug of the operator or do I have to escape the url and if yes in which way?

 

Regards

Johannes

Find more posts tagged with

Sort by:
1 - 5 of 51

    It's giving me a bad request (400) if I just plug in the URL into a single Get Page. I think it's Apple preventing people like use from using their stuff. Maybe @Edin_Klapic has an idea about this. 

    Hi Johannes,

     

    I tried your URL with various RapidMiner Operators, which are

    Get Pages, Get Page, Enrich Data by Webservice as well as Open File (from URL) in combination with Read Document.

    None of them delivered the desired output. But I can confirm that I got the same result you did.

     

    Regarding your Encoding question:

    In your use case I tried to encode the part you mentioned - but this did not help

    http://itunes.apple.com/search?term="Google Übersetzer"&entity=software&country=de&media=software&limit=5
    ==>

    When I load the URL in my browser a .txt file is downloaded to my computer - I suspect the problem here.

    If you can try this with a website where you only receive a JSON string as result we should get this going.

     

    Best regards,

    Edin

     

    jhillerUser: "jhiller"
    New Altair Community Member
    OP

    Hi,

     

    Thanks a lot for your work!

    I'm sorry for the late response. There was a mistake in my process. The user agent must be randomized. The following process shows my problem better.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
    <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Google Übersetzer&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
    <parameter key="add label" value="false"/>
    <parameter key="label_type" value="nominal"/>
    </operator>
    <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document (2)" width="90" x="45" y="136">
    <parameter key="text" value="https://itunes.apple.com/search?term=&quot;Whatsapp&quot;&amp;entity=software&amp;country=de&amp;media=software&amp;limit=5"/>
    <parameter key="add label" value="false"/>
    <parameter key="label_type" value="nominal"/>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.4.001" expanded="true" height="103" name="Documents to Data" width="90" x="246" y="34">
    <parameter key="text_attribute" value="att1"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    </operator>
    <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="getPages" width="90" x="447" y="34">
    <parameter key="link_attribute" value="att1"/>
    <parameter key="page_attribute" value="html"/>
    <parameter key="random_user_agent" value="true"/>
    <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"/>
    <parameter key="connection_timeout" value="2000"/>
    <parameter key="read_timeout" value="2000"/>
    <parameter key="follow_redirects" value="true"/>
    <parameter key="accept_cookies" value="none"/>
    <parameter key="cookie_scope" value="global"/>
    <parameter key="request_method" value="POST"/>
    <parameter key="delay" value="random"/>
    <parameter key="delay_amount" value="5000"/>
    <parameter key="min_delay_amount" value="2000"/>
    <parameter key="max_delay_amount" value="5000"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="getPages" to_port="Example Set"/>
    <connect from_op="getPages" from_port="Example Set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    You see that the process is working with in case of the second row. It is not working with the special charakter in the first row. So I still think that this is an encoding-problem in the implementation of the "Get Pages"-operator.

     

    Best Regards,

    Johannes

    Edin_KlapicUser: "Edin_Klapic"
    New Altair Community Member
    Accepted Answer

    The link needs to be encoded as follows

    https://itunes.apple.com/search?term="Google+%C3%9Cbersetzer"&entity=software&country=de&media=software&limit=5

    My first suggestion %DC as encoding for the letter Ü is only partly correct - For UTF-8 ist needs to be %C3%9C.

     

    You can test such URLencoding related stuff on various websites (e.g. here).

     

    Best,

    Edin

    jhillerUser: "jhiller"
    New Altair Community Member
    OP

    Thanks a lot. The solution is working!