"Filter Stopwords (Dictionary) -- Unicode support?"
pleonard
New Altair Community Member
Hi there, I'm having good luck with the Filter Stopwords (Dictionary) in creating a stoplist for Danish, but am finding that a-ring (å) is not obeyed. I've confirmed the file is in utf-8, as are my source texts, and that the linefeeds are correct. Other stopwords, that do not include non-ascii, are being filtered correctly. Anyone come across this before?
Tagged:
0
Answers
-
Hi,
no I haven't but the number of danish text's I processes is close to zero Did you make sure that RapidMiner opens the text file in the UTF-8 encoding?
Anyway: If you have good stopword file for danish, would you like to contribute it? We could include it into core...
Greetings,
Sebastian0 -
OK, I've confirmed this is a bug, I think. Let's move to German because that is a more common language:
Set these two things:
1) rapidminer.general.encoding to UTF-8
2) Process Documents from Files to UTF-8
Ensure both your text and stoplist are in UTF-8.
Text: schloß means castle.
Stoplist: schloß castle
Result: schloß means
This is with RapidMiner 5.1.001 on MacOS X 10.6. Surely there must be people from Germany working with this who have noticed this problem before -- or a trick to get around it?
Thanks!0 -
Hi,
i have added a parameter for choosing the encoding of the dictionary. This will be made available with the next TextExtension release. But it's uncertain when this will be.
Greetings,
Sebastian0 -
Thanks! If you have any need of a beta-tester (I work with large Swedish, Danish and Norwegian texts) please let me know and I'd be glad to help out...0
-
Hi,
we are currently working on a completely new Text Extension that will go beyond everything the old one was able to do. We will document our progress in our Special Interest Group for Text Mining. If you want to participate, you are very welcome. I just need your email in a PM to put you on the list.
Greetings,
Sebastian0