Text Tokenization Using Regular Expression For Text Mining

Question

Hello, 
I have a problem and i need your help, please.
 I want to tokenize a unstructured  document using regular expression. I have a text file where each rows include a sentence such as:

1. String1 String2 String3 String4 String5
2. String6      -      String7    -           -
...
n. String8    -        String9 String10   -               (assume string2 and string5 dont exist.)

What I exactly want to do is that tokenization will extract each word  and give the results in a table in Excel format such as:

S1              S2            S3            S4             S5
1.   String1    String2    String3      String4    String5
2.   String6        -          String7          -              -
3.
..
n.   String8        -          String9      String10      -

which operators and and which regular expression structure can i use in Rapid Miner?
Thank you for your help in advance.

MariusHelf · Answer

If your original document contains the dashes you can simply read it with Read CSV and specify all blanks (space, tab, etc.) as column separator.

Best regards,
Marius