"Filter Stopwords (Dictionary) -- Unicode support?"

pleonard
pleonard New Altair Community Member
edited November 5 in Community Q&A
Hi there, I'm having good luck with the Filter Stopwords (Dictionary) in creating a stoplist for Danish, but am finding that a-ring (å) is not obeyed. I've confirmed the file is in utf-8, as are my source texts, and that the linefeeds are correct. Other stopwords, that do not include non-ascii, are being filtered correctly. Anyone come across this before?

Answers

  • land
    land New Altair Community Member
    Hi,

    no I haven't but the number of danish text's I processes is close to zero :) Did you make sure that RapidMiner opens the text file in the UTF-8 encoding?

    Anyway: If you have  good stopword file for danish, would you like to contribute it? We could include it into core...

    Greetings,
      Sebastian
  • pleonard
    pleonard New Altair Community Member
    OK, I've confirmed this is a bug, I think. Let's move to German because that is a more common language:

    Set these two things:

    1) rapidminer.general.encoding to UTF-8
    2) Process Documents from Files to UTF-8

    Ensure both your text and stoplist are in UTF-8.

    Text: schloß means castle.
    Stoplist: schloß castle

    Result: schloß means

    This is with RapidMiner 5.1.001 on MacOS X 10.6.  Surely there must be people from Germany working with this who have noticed this problem before -- or a trick to get around it?

    Thanks!
  • land
    land New Altair Community Member
    Hi,
    i have added a parameter for choosing the encoding of the dictionary. This will be made available with the next TextExtension release. But it's uncertain when this will be.

    Greetings,
      Sebastian
  • pleonard
    pleonard New Altair Community Member
    Thanks! If you have any need of a beta-tester (I work with large Swedish, Danish and Norwegian texts) please let me know and I'd be glad to help out...
  • land
    land New Altair Community Member
    Hi,
    we are currently working on a completely new Text Extension that will go beyond everything the old one was able to do. We will document our progress in our Special Interest Group for Text Mining. If you want to participate, you are very welcome. I just need your email in a PM to put you on the list.

    Greetings,
      Sebastian