[SOLVED]The approach for filtering non-letter tokens

Unknown · October 2012

In Rapidminer, I use tokenize operator to process a lot of documents. Currently, I have some documents that have a lot of no-letter characters, such as digits, %, $ or any other non-letter symbols. Are there any operators that can allow me to filter these tokens? Thanks.

MariusHelf · October 2012

Hi,

first of all, you have to configure the Tokenize operator to use a splitting pattern appropriate to your problem. By default, it splits at "non-letters", you could change it to e.g. split by all space characters.

Then, to filter, you can use the Filter Tokens operator with a customized pattern.

If you have probems with the regular expressions, please post again.

Happy Mining!
~Marius

Unknown · October 2012

Marius, Thanks.

[SOLVED]The approach for filtering non-letter tokens

Answers

Categories