[SOLVED]The approach for filtering non-letter tokens

Unknown
edited November 5 in Community Q&A
In Rapidminer, I use tokenize operator to process a lot of documents. Currently, I have some documents that have a lot of no-letter characters, such as digits, %, $ or any other non-letter symbols. Are there any operators that can allow me to filter these tokens? Thanks.
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    first of all, you have to configure the Tokenize operator to use a splitting pattern appropriate to your problem. By default, it splits at "non-letters", you could change it to e.g. split by all space characters.

    Then, to filter, you can use the Filter Tokens operator with a customized pattern.

    If you have probems with the regular expressions, please post again.

    Happy Mining!
    ~Marius
  • Marius, Thanks.