Why UTF-8 is not working?
heron_oliveira
New Altair Community Member
Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.
Tagged:
0
Best Answer
-
Hi there,what operator do you use to read the text file? It should have a setting as well.Cheers,Martin1
Answers
-
My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...0
-
I would also like to know how to must be the stop words list format. Since there is no Portuguese stop words operator, I made a list document, but I don't know if it accepts list format or if it should be dictionary or something else.0
-
Hi there,what operator do you use to read the text file? It should have a setting as well.Cheers,Martin1
-
Exactly, I was changing enconde in the settings > preferences. But in fact I should've done it on the operator settings. Thanks!0