Why UTF-8 is not working?

heron_oliveira
heron_oliveira New Altair Community Member
edited November 2024 in Community Q&A
Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.
Tagged:

Best Answer

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi there,
    what operator do you use to read the text file? It should have a setting as well.

    Cheers,
    Martin

Answers

  • heron_oliveira
    heron_oliveira New Altair Community Member
    My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...
  • heron_oliveira
    heron_oliveira New Altair Community Member
    I would also like to know how to must be the stop words list format. Since there is no Portuguese stop words operator, I made a list document, but I don't know if it accepts list format or if it should be dictionary or something else.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    Hi there,
    what operator do you use to read the text file? It should have a setting as well.

    Cheers,
    Martin
  • heron_oliveira
    heron_oliveira New Altair Community Member
    Exactly, I was changing enconde in the settings > preferences. But in fact I should've done it on the operator settings. Thanks!