Define spliting characters for tokenizer?
Hi!
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...