Text Mining How to remove particular phrases in pre-processing
Whats the best way to remove repeated sentences from my documents during pre-processing ?
I have a example set that includes a "text" column and some other attributes. The text column was read in from files in a folder. The text itself has a number of repeated phrases that I "think" I should remove before mining as I think they would skew the word frequency.
Given the "Filter Stopwords (Dictionary)" can only remove 1 stopword per line how do I handle a case like wanting to remove "Assessment and Grading" but still keep the word assessment and the word grading if they are located elsewhere in the document and how do I expand it so I can add other sentences I need removed
I have a example set that includes a "text" column and some other attributes. The text column was read in from files in a folder. The text itself has a number of repeated phrases that I "think" I should remove before mining as I think they would skew the word frequency.
Given the "Filter Stopwords (Dictionary)" can only remove 1 stopword per line how do I handle a case like wanting to remove "Assessment and Grading" but still keep the word assessment and the word grading if they are located elsewhere in the document and how do I expand it so I can add other sentences I need removed