UTF-8 encoded text doesn't get right out of the Get Page operator

s_nektarijevic · December 2018

Dear RapidMiners,

I am having an issue with the Get Page operator and UTF-8 encoding.

I am scraping the content of this web page:

https://www.fda.gov/RegulatoryInformation/Guidances/default.htm

According to the html code I get out of Get Page, this page uses UTF-8:

The problem is that for example: FDA’s turns out as FDAâs.

I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:

"Encoding 'SYSTEM' is not supported"

Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?

Many thanks in advance for any kind of input!

Snežana

Marco_Boeck · December 2018

Hi,

This works just for me:

<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="random_user_agent" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="connection_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="read_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="follow_redirects" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="accept_cookies" value="none"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="cookie_scope" value="global"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="request_method" value="GET"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="query_parameters"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="request_properties"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="override_encoding" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="encoding" value="UTF-8"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process></div><div></div>

kayman · December 2018

Is your process itself also using UTF-8?
When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding

s_nektarijevic · December 2018

Dear @kayman ,

Many thanks for your suggestion! However it didn't really help resolving my case :-(

I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.

In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:

After Get Page:

CVM GFI #108 Registering with CVMâ\200\203s Electronic Submission System

In the final Example Set:

Registering with CVMâ€™s Electronic Submission System

Any idea what is still wrong?

Many thanks in advance for any kind of input!

Snezana

Marco_Boeck · December 2018

Hi,

This works just for me:

<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div>&nbsp; <context></div><div>&nbsp; &nbsp; <input/></div><div>&nbsp; &nbsp; <output/></div><div>&nbsp; &nbsp; <macros/></div><div>&nbsp; </context></div><div>&nbsp; <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div>&nbsp; &nbsp; <parameter key="logverbosity" value="init"/></div><div>&nbsp; &nbsp; <parameter key="random_seed" value="2001"/></div><div>&nbsp; &nbsp; <parameter key="send_mail" value="never"/></div><div>&nbsp; &nbsp; <parameter key="notification_email" value=""/></div><div>&nbsp; &nbsp; <parameter key="process_duration_for_mail" value="30"/></div><div>&nbsp; &nbsp; <parameter key="encoding" value="SYSTEM"/></div><div>&nbsp; &nbsp; <process expanded="true"></div><div>&nbsp; &nbsp; &nbsp; <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="random_user_agent" value="false"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="connection_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="read_timeout" value="10000"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="follow_redirects" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="accept_cookies" value="none"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="cookie_scope" value="global"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="request_method" value="GET"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="query_parameters"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <list key="request_properties"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="override_encoding" value="true"/></div><div>&nbsp; &nbsp; &nbsp; &nbsp; <parameter key="encoding" value="UTF-8"/></div><div>&nbsp; &nbsp; &nbsp; </operator></div><div>&nbsp; &nbsp; &nbsp; <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="source_input 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 1" spacing="0"/></div><div>&nbsp; &nbsp; &nbsp; <portSpacing port="sink_result 2" spacing="0"/></div><div>&nbsp; &nbsp; </process></div><div>&nbsp; </operator></div><div></process></div><div></div>

UTF-8 encoded text doesn't get right out of the Get Page operator

Welcome!

Best Answer

Answers

Welcome!

Welcome!

Quick Links

Categories