Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
"Filter Stopwords (Dictionary) -- Unicode support?"
pleonard
Hi there, I'm having good luck with the Filter Stopwords (Dictionary) in creating a stoplist for Danish, but am finding that a-ring (å) is not obeyed. I've confirmed the file is in utf-8, as are my source texts, and that the linefeeds are correct. Other stopwords, that do not include non-ascii, are being filtered correctly. Anyone come across this before?
Find more posts tagged with
AI Studio
Text Mining + NLP
Filtering
Accepted answers
All comments
land
Hi,
no I haven't but the number of danish text's I processes is close to zero
Did you make sure that RapidMiner opens the text file in the UTF-8 encoding?
Anyway: If you have good stopword file for danish, would you like to contribute it? We could include it into core...
Greetings,
Sebastian
pleonard
OK, I've confirmed this is a bug, I think. Let's move to German because that is a more common language:
Set these two things:
1) rapidminer.general.encoding to UTF-8
2) Process Documents from Files to UTF-8
Ensure both your text and stoplist are in UTF-8.
Text: schloß means castle.
Stoplist: schloß castle
Result: schloß means
This is with RapidMiner 5.1.001 on MacOS X 10.6. Surely there must be people from Germany working with this who have noticed this problem before -- or a trick to get around it?
Thanks!
land
Hi,
i have added a parameter for choosing the encoding of the dictionary. This will be made available with the next TextExtension release. But it's uncertain when this will be.
Greetings,
Sebastian
pleonard
Thanks! If you have any need of a beta-tester (I work with large Swedish, Danish and Norwegian texts) please let me know and I'd be glad to help out...
land
Hi,
we are currently working on a completely new Text Extension that will go beyond everything the old one was able to do. We will document our progress in our Special Interest Group for Text Mining. If you want to participate, you are very welcome. I just need your email in a PM to put you on the list.
Greetings,
Sebastian
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups