Home
Discussions
Community Q&A
Text processing
chanokpy
I'm wondering if RapidMiner can token Thai sentence into word? if not, how can I filter out Thai character?
Thank you in advance!
Dtip
Find more posts tagged with
AI Studio
Text Mining + NLP
Accepted answers
kayman
Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.
I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.
An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.
All comments
kayman
Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.
I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.
An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)