How to split strings contained in a text column of csv file into words
Ayushi_Aggarwal
New Altair Community Member
As of now, I am reading a CSV file which has review(text), n1, n2, n3, overall (text) columns.
I am using select attributes to include only review column, which gives me an output in rapidminer of the form:
Row Review
1 Poor service
2 There were torn seats
What i want to do is split the contents of Review column into individual words like : Poor, service, There, etc.
I am using Process documnets to data > Tokenize but somehow not getting the required output.
Please help.
I am using select attributes to include only review column, which gives me an output in rapidminer of the form:
Row Review
1 Poor service
2 There were torn seats
What i want to do is split the contents of Review column into individual words like : Poor, service, There, etc.
I am using Process documnets to data > Tokenize but somehow not getting the required output.
Please help.
Tagged:
0
Best Answers
-
Hi,if you don't necessarily have to use the Text extension. You could also simply use the "Split" Operator (not to confuse with "Split Data") and use a regular expression. I would say something simple like \s+|\W+ should do the trick (to split along spaces or non word characters (letters and numbers).Best,
David
1 -
Can you be more clear about why Tokenize is not giving you what you expect? What are you getting? If you share your process and a data sample it will be easier to troubleshoot. In general Tokenize should do exactly what you are asking for, take a text column and split it up into individual words.5
Answers
-
Hi,if you don't necessarily have to use the Text extension. You could also simply use the "Split" Operator (not to confuse with "Split Data") and use a regular expression. I would say something simple like \s+|\W+ should do the trick (to split along spaces or non word characters (letters and numbers).Best,
David
1 -
Can you be more clear about why Tokenize is not giving you what you expect? What are you getting? If you share your process and a data sample it will be easier to troubleshoot. In general Tokenize should do exactly what you are asking for, take a text column and split it up into individual words.5