Text Tokenization Using Regular Expression For Text Mining
onurer007
New Altair Community Member
Hello,
I have a problem and i need your help, please.
I want to tokenize a unstructured document using regular expression. I have a text file where each rows include a sentence such as:
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
...
n. String8 - String9 String10 - (assume string2 and string5 dont exist.)
What I exactly want to do is that tokenization will extract each word and give the results in a table in Excel format such as:
S1 S2 S3 S4 S5
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
3.
..
n. String8 - String9 String10 -
which operators and and which regular expression structure can i use in Rapid Miner?
Thank you for your help in advance.
I have a problem and i need your help, please.
I want to tokenize a unstructured document using regular expression. I have a text file where each rows include a sentence such as:
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
...
n. String8 - String9 String10 - (assume string2 and string5 dont exist.)
What I exactly want to do is that tokenization will extract each word and give the results in a table in Excel format such as:
S1 S2 S3 S4 S5
1. String1 String2 String3 String4 String5
2. String6 - String7 - -
3.
..
n. String8 - String9 String10 -
which operators and and which regular expression structure can i use in Rapid Miner?
Thank you for your help in advance.
Tagged:
0
Answers
-
If your original document contains the dashes you can simply read it with Read CSV and specify all blanks (space, tab, etc.) as column separator.
Best regards,
Marius0