Replacing special characters when text mining
Hi
I am attempting to replace special characters inside of a dataset but not having much luck.
André Shoémaker
Adrié Spéllman
It seems as though RapidMiner is unable to understand those characters and is simply marking them as �. Non of the replace operators can seem to fix this issue.
Should I be doing this inside of a generate attributes operator to account for this or is there a better way?
I am attempting to replace special characters inside of a dataset but not having much luck.
André Shoémaker
Adrié Spéllman
It seems as though RapidMiner is unable to understand those characters and is simply marking them as �. Non of the replace operators can seem to fix this issue.
Should I be doing this inside of a generate attributes operator to account for this or is there a better way?
Find more posts tagged with
Sort by:
1 - 4 of
41
hi @robin yep seems like an encoding issue. Try changing the settings:


Unfortunately there are a pile of options there (way too many IMHO)...my go-to ones that I try when I get that problem are ISO-8859-1, UTF-8, or windows-1250. If none of those work, I move on to ASCII (yuck) and so on. Sometimes you can look at the original data source and find out the encoding from there...but not always.
Scott


Unfortunately there are a pile of options there (way too many IMHO)...my go-to ones that I try when I get that problem are ISO-8859-1, UTF-8, or windows-1250. If none of those work, I move on to ASCII (yuck) and so on. Sometimes you can look at the original data source and find out the encoding from there...but not always.

Scott
Hi @robin,
If you are using UNIX (Linux/macOS), you may be able to execute the following command: file -I filename.csv. I took a screenshot for you:

In the charset= field, you have the encoding you can use, then RapidMiner will save you from the hassle of converting characters. At least in my case (Chilean Spanish, Portuguese, Swiss German and sometimes struggling with EBCDIC because that's what the sensors I read at work send), this has been quite helpful.
All the best,
Rodrigo.
If you are using UNIX (Linux/macOS), you may be able to execute the following command: file -I filename.csv. I took a screenshot for you:

In the charset= field, you have the encoding you can use, then RapidMiner will save you from the hassle of converting characters. At least in my case (Chilean Spanish, Portuguese, Swiss German and sometimes struggling with EBCDIC because that's what the sensors I read at work send), this has been quite helpful.
All the best,
Rodrigo.
Sort by:
1 - 2 of
21
hi @robin yep seems like an encoding issue. Try changing the settings:


Unfortunately there are a pile of options there (way too many IMHO)...my go-to ones that I try when I get that problem are ISO-8859-1, UTF-8, or windows-1250. If none of those work, I move on to ASCII (yuck) and so on. Sometimes you can look at the original data source and find out the encoding from there...but not always.
Scott


Unfortunately there are a pile of options there (way too many IMHO)...my go-to ones that I try when I get that problem are ISO-8859-1, UTF-8, or windows-1250. If none of those work, I move on to ASCII (yuck) and so on. Sometimes you can look at the original data source and find out the encoding from there...but not always.

Scott
Hi @robin,
If you are using UNIX (Linux/macOS), you may be able to execute the following command: file -I filename.csv. I took a screenshot for you:

In the charset= field, you have the encoding you can use, then RapidMiner will save you from the hassle of converting characters. At least in my case (Chilean Spanish, Portuguese, Swiss German and sometimes struggling with EBCDIC because that's what the sensors I read at work send), this has been quite helpful.
All the best,
Rodrigo.
If you are using UNIX (Linux/macOS), you may be able to execute the following command: file -I filename.csv. I took a screenshot for you:

In the charset= field, you have the encoding you can use, then RapidMiner will save you from the hassle of converting characters. At least in my case (Chilean Spanish, Portuguese, Swiss German and sometimes struggling with EBCDIC because that's what the sensors I read at work send), this has been quite helpful.
All the best,
Rodrigo.
Where is your dataset coming from? Operators like Read CSV that import external data have an "encoding" parameter. If you set the encoding correctly for your data import, you'll see all special characters.
Regards,
Balázs