Why UTF-8 is not working?
Today I converted a pdf to txt, and I'm trying to analyse some therms frequency in the text. Despite the txt is in UTF-8 and I've already changed the program's encoding into the default (SYSTEM) or into 'UTF-8' before tokenizing, generating n_grams, it keeps showing incorrect words. For example, the word should've been 'abrangência' inetead of 'abrangãºncia'.
Find more posts tagged with
Sort by:
1 - 4 of
41

My txt file has correct words, it only happends when I run operators in RapidMiner. And I'm using operators for tokenizing, Transform Cases, Generate n-Grams, Filter Tokens and Filter StopWords. But the problem begins since the first operator wich is Tokenize...
Sort by:
1 - 1 of
11
Hi there,
what operator do you use to read the text file? It should have a setting as well.
Cheers,
Martin