I'm using the StringTokenizer (part of the Text plugin (v. 4.5)) to try and create tokens from text. The problem I am having is that my text documents contain alphanumeric codes that I would like to tokenize. It appears the StringTokenizer only tokenizes words and removes any numeric or special characters.
For example, within my text document I may have three alphanumeric codes as shown below. Using the StringTokenizer will result in a single token called "C" (all numbers and special characters are removed). What I would like is for the StringTokenizer to find two tokens ("C847_0" and "C372_-1") for this document.
C847_0 C372_-1 C847_0
Are there any options in the StringTokenizer that I can set to allow alphanumeric tokens? Or is there an operator that simply creates attributes by splitting on spaces and then I could simply filter out whatever type of attributes I don't need (e.g., purely numeric tokens)?
Any help would be appreciated.
Thanks.