Text processing
chanokpy
New Altair Community Member
I'm wondering if RapidMiner can token Thai sentence into word? if not, how can I filter out Thai character?
Thank you in advance!
Dtip
Thank you in advance!
Dtip
Tagged:
1
Best Answer
-
Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.
I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.
An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.2
Answers
-
Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.
I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.
An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.2