Using HIndi Language in Tokenizer
alabiit
New Altair Community Member
Hi,
My documents to be analyzed are in Hindi. The encoding format is UTF-8. For creating the word Vector I have used WVplugin. The problem is that I am not getting all the tokens (I used all the tokenizers in rapidminer 4.6), in fact i am getting too low - 4 to be precise ???
I changed the content language and encoding to Hindi and UTF, but without any sucess - is there any additional setup to be done to tokenize the text properly?
~alabiit
My documents to be analyzed are in Hindi. The encoding format is UTF-8. For creating the word Vector I have used WVplugin. The problem is that I am not getting all the tokens (I used all the tokenizers in rapidminer 4.6), in fact i am getting too low - 4 to be precise ???
I changed the content language and encoding to Hindi and UTF, but without any sucess - is there any additional setup to be done to tokenize the text properly?
~alabiit
Tagged:
0
Answers
-
Hi,
from RapidMiner 5.0 on, you can configure the tokenizer more detailed. You can enter arbitrary split characters so that it should work with any language that splits its words with a character at all.
Greetings,
Sebastian0 -
Hi sebastin,
Thanks. But upgrading is always tough job .
will check out.
~alabiit0