Using HIndi Language in Tokenizer

alabiit
alabiit New Altair Community Member
edited November 5 in Community Q&A
Hi,

My documents to be analyzed are in Hindi. The encoding format is UTF-8. For creating the word Vector I have used WVplugin. The problem is that  I am not getting all the tokens (I used all the tokenizers in rapidminer 4.6), in fact i am getting too low - 4 to be precise  ???

I changed the  content language and encoding to Hindi and UTF, but without any sucess - is there any additional setup to be done to tokenize the  text properly?

~alabiit
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    from RapidMiner 5.0 on, you can configure the tokenizer more detailed. You can enter arbitrary split characters so that it should work with any language that splits its words with a character at all.

    Greetings,
      Sebastian
  • alabiit
    alabiit New Altair Community Member
    Hi sebastin,

    Thanks. But upgrading is always tough job  :'(.

    will check out.
    ~alabiit