UTF-8 encoded text doesn't get right out of the Get Page operator

s_nektarijevic
s_nektarijevic New Altair Community Member
edited November 2024 in Community Q&A
Dear RapidMiners,

I am having an issue with the Get Page operator and UTF-8 encoding.

I am scraping the content of this web page:


According to the html code I get out of Get Page, this page uses UTF-8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The problem is that for example: FDA’s turns out as FDA’s.

I tried enforcing the right encoding by checking the "override encoding" box in the Get Page operator, but if I do that, I get an error message:

"Encoding 'SYSTEM' is not supported"

Any idea how to solve this (without having to manually search and replace the unwanted characters please!) ?

Many thanks in advance for any kind of input!

Snežana

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

Answers

  • kayman
    kayman New Altair Community Member
    edited December 2018
     Is your process itself also using UTF-8?
    When you click into your main window you can also define the encoding for the process itself in the parameters. Typically I set this also to UTF-8, and do the same in settings -> preferences -> general -> encoding
  • s_nektarijevic
    s_nektarijevic New Altair Community Member
    Dear @kayman ,

    Many thanks for your suggestion! However it didn't really help resolving my case :-(

    I am not sure whether I am doing the things right, but I just adjusted the settings as you suggested and reran the process, and got the same result as before. I also tried restarting RapidMiner after adjusting the settings, but nothing changed. I am not exactly sure where the problem is, but no matter which encoding settings I choose (I tried SYSTEM, UTF-8 and ISO-8859-1 for fun), I get the same.

    In any case, what I see straight out of Get Page is different from what I see in the final Example Set. Here is an example:

    After Get Page:
    CVM GFI #108 Registering with CVM’€™€™\200\203€™s Electronic Submission System

    In the final Example Set:
    Registering with CVM’s Electronic Submission System

    Any idea what is still wrong?

    Many thanks in advance for any kind of input!

    Snezana




Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.