"Remove all lines with text occurency smaller than 10 from certain column"

Hi,

Im trying to refer to a certain column of the sample set and remove all lines smaller than 10. Whats the way to do that?

e.g.

Process Documents from Files >> Filter Stopwords >> Tokenize >> Transform Cases >> Stem >> ??? now remove all lines where the clumn "text occurence" is lower than 10 ???

Find more posts tagged with

AI Studio

Text Mining + NLP

Getting Started

Accepted answers

All comments

Telcontar120

Your question is a bit confusing. Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens? In either case, RapidMiner can do it. In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10. In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10.

Moritz

Thanks for the fast answer.

Let me try to rephrase a bit: The task is to remove all words from our document with a total occurance smaller than 10. I already tried the pruning operator, but since there is no option to refer to the column "total occurance", i dont have the opporutity to prune after it / remove all words with a smaller occurance than 10

@Telcontar120 wrote:
Your question is a bit confusing. Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens? In either case, RapidMiner can do it. In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10. In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10.

Telcontar120

Ah, got it. "Wordlist to Data" will let you take the wordlist and turn it into an exampleset and then you will be able to Filter on the "Total Occurrences" column.

Moritz

Okay. I did the first part, but I still cant filter for columns. Where do I apply the filter? / Which filter do I apply

Telcontar120

Use "Filter Examples" and then set your condition to values where the Total Occurrence column is greater than 10.