"Remove all lines with text occurency smaller than 10 from certain column"
Find more posts tagged with
Thanks for the fast answer.
Let me try to rephrase a bit: The task is to remove all words from our document with a total occurance smaller than 10. I already tried the pruning operator, but since there is no option to refer to the column "total occurance", i dont have the opporutity to prune after it / remove all words with a smaller occurance than 10
@Telcontar120 wrote:Your question is a bit confusing. Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens? In either case, RapidMiner can do it. In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10. In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10.
.
Your question is a bit confusing. Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens? In either case, RapidMiner can do it. In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10. In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10.