UTF-8 encoded text doesn't get right out of the Get Page operator
s_nektarijevic
New Altair Community Member
Dear RapidMiners,
I am having an issue with the Get Page operator and UTF-8 encoding.
I am scraping the content of this web page:
According to the html code I get out of Get Page, this page uses UTF-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The problem is that for example: FDA’s turns out as FDAâs.
I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:
"Encoding 'SYSTEM' is not supported"
Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?
Many thanks in advance for any kind of input!
Snežana
Tagged:
0
Best Answer
-
Hi,
This works just for me:<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div> <context></div><div> <input/></div><div> <output/></div><div> <macros/></div><div> </context></div><div> <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div> <parameter key="logverbosity" value="init"/></div><div> <parameter key="random_seed" value="2001"/></div><div> <parameter key="send_mail" value="never"/></div><div> <parameter key="notification_email" value=""/></div><div> <parameter key="process_duration_for_mail" value="30"/></div><div> <parameter key="encoding" value="SYSTEM"/></div><div> <process expanded="true"></div><div> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div> <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div> <parameter key="random_user_agent" value="false"/></div><div> <parameter key="connection_timeout" value="10000"/></div><div> <parameter key="read_timeout" value="10000"/></div><div> <parameter key="follow_redirects" value="true"/></div><div> <parameter key="accept_cookies" value="none"/></div><div> <parameter key="cookie_scope" value="global"/></div><div> <parameter key="request_method" value="GET"/></div><div> <list key="query_parameters"/></div><div> <list key="request_properties"/></div><div> <parameter key="override_encoding" value="true"/></div><div> <parameter key="encoding" value="UTF-8"/></div><div> </operator></div><div> <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div> <portSpacing port="source_input 1" spacing="0"/></div><div> <portSpacing port="sink_result 1" spacing="0"/></div><div> <portSpacing port="sink_result 2" spacing="0"/></div><div> </process></div><div> </operator></div><div></process></div><div></div>
5
Answers
-
Is your process itself also using UTF-8?
When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding0 -
Dear @kayman ,Many thanks for your suggestion! However it didn't really help resolving my case :-(I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:After Get Page:CVM GFI #108 Registering with CVMâ\200\203s Electronic Submission SystemIn the final Example Set:Registering with CVM’s Electronic Submission SystemAny idea what is still wrong?Many thanks in advance for any kind of input!Snezana
0 -
Hi,
This works just for me:<div><?xml version="1.0" encoding="UTF-8"?><process version="9.2.000-SNAPSHOT"></div><div> <context></div><div> <input/></div><div> <output/></div><div> <macros/></div><div> </context></div><div> <operator activated="true" class="process" compatibility="9.2.000-SNAPSHOT" expanded="true" name="Process"></div><div> <parameter key="logverbosity" value="init"/></div><div> <parameter key="random_seed" value="2001"/></div><div> <parameter key="send_mail" value="never"/></div><div> <parameter key="notification_email" value=""/></div><div> <parameter key="process_duration_for_mail" value="30"/></div><div> <parameter key="encoding" value="SYSTEM"/></div><div> <process expanded="true"></div><div> <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="45" y="34"></div><div> <parameter key="url" value="https://www.fda.gov/RegulatoryInformation/Guidances/default.htm"/></div><div> <parameter key="random_user_agent" value="false"/></div><div> <parameter key="connection_timeout" value="10000"/></div><div> <parameter key="read_timeout" value="10000"/></div><div> <parameter key="follow_redirects" value="true"/></div><div> <parameter key="accept_cookies" value="none"/></div><div> <parameter key="cookie_scope" value="global"/></div><div> <parameter key="request_method" value="GET"/></div><div> <list key="query_parameters"/></div><div> <list key="request_properties"/></div><div> <parameter key="override_encoding" value="true"/></div><div> <parameter key="encoding" value="UTF-8"/></div><div> </operator></div><div> <connect from_op="Get Page" from_port="output" to_port="result 1"/></div><div> <portSpacing port="source_input 1" spacing="0"/></div><div> <portSpacing port="sink_result 1" spacing="0"/></div><div> <portSpacing port="sink_result 2" spacing="0"/></div><div> </process></div><div> </operator></div><div></process></div><div></div>
5