🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

UTF-8 encoded text doesn't get right out of the Get Page operator

User: "s_nektarijevic"
New Altair Community Member
Updated by Jocelyn
Dear RapidMiners,

I am having an issue with the Get Page operator and UTF-8 encoding.

I am scraping the content of this web page:


According to the html code I get out of Get Page, this page uses UTF-8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The problem is that for example: FDA’s turns out as FDA’s.

I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:

"Encoding 'SYSTEM' is not supported"

Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?

Many thanks in advance for any kind of input!

Snežana

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "kayman"
    New Altair Community Member
    Updated by kayman
     Is your process itself also using UTF-8?
    When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding
    User: "s_nektarijevic"
    New Altair Community Member
    OP
    Dear @kayman ,

    Many thanks for your suggestion! However it didn't really help resolving my case :-(

    I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.

    In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:

    After Get Page:
    CVM GFI #108 Registering with CVM’€™€™\200\203€™s Electronic Submission System

    In the final Example Set:
    Registering with CVM’s Electronic Submission System

    Any idea what is still wrong?

    Many thanks in advance for any kind of input!

    Snezana




    User: "Marco_Boeck"
    New Altair Community Member
    Accepted Answer
    Hi,

    This works just for me:
    <div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="random_user_agent" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="connection_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="read_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="follow_redirects" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="accept_cookies" value="none"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="cookie_scope" value="global"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="request_method" value="GET"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="query_parameters"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="request_properties"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="override_encoding" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="encoding" value="UTF-8"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process></div><div></div>