Text processing

chanokpy
chanokpy New Altair Community Member
edited November 5 in Community Q&A
I'm wondering if RapidMiner can token Thai sentence into word? if not, how can I filter out Thai character?

Thank you in advance!
Dtip

Best Answer

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.

    I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.

    An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.

Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    Not out of the box. It is however feasible to use python as an external tokenizer, and then continue your workflow in Rapidminer.

    I've had similar issues with Japanese where there are no spaces also and using an external side step to do the tokenization and then back into RM worked fine.. There are quite some tokenizers available for python that cover these 'non spaced' locales, so it's about finding the best that suits your needs.

    An alternative way is to use a dictionary where you use rm to match for the longest word possible to make. This worked pretty fine for Japanese with some scenarios in our case where the variety of words was fairly limited but using python turned out to be more reliable and (much) faster. I also don't know any Thai so it might be more complex as for Japanese or Chinese where this max match is an option.