Filter Stopwords with Regular Expression
Anna_May1
New Altair Community Member
Hi guys,
I'm currently doing a sentiment analysis in Rapidminer with Knn. I want to count the number of words that are left in the document when removing stopwords. Using the "Filter stopwords" operator inside the "process documents from data operator" only works if I tokenize the data and use the "Nominal to Text" operator first. The issue here is that the output then is as in the image below. I want to be able to count the words that are left after removing the stopwords, so I wonder if there is maybe a regular expression which could be used inside a "Replace" operator or so, to only remove the stopwords without tokenizing it.
Cheers!
I'm currently doing a sentiment analysis in Rapidminer with Knn. I want to count the number of words that are left in the document when removing stopwords. Using the "Filter stopwords" operator inside the "process documents from data operator" only works if I tokenize the data and use the "Nominal to Text" operator first. The issue here is that the output then is as in the image below. I want to be able to count the words that are left after removing the stopwords, so I wonder if there is maybe a regular expression which could be used inside a "Replace" operator or so, to only remove the stopwords without tokenizing it.
Cheers!
0
Answers
-
@Anna_May1 I am unable to see the image as you have not attached it. However, it would be much easier to deal with stop words, or count words, after you tokenise the text. For example, you can have two streams of text processing, one with and and one without stop words, then for both you can count tokens and find the difference. In fact, when your text representation is by frequency, the counting is very simple - adding those frequencies within columns.0