🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Text Mining How to remove particular phrases in pre-processing

User: "mob"
New Altair Community Member
Updated by Jocelyn
Whats the best way to remove repeated sentences from my documents during pre-processing ?

I have a example set that includes a "text" column and some other attributes. The text column was read in from files in a folder. The text itself has a number of repeated phrases that I "think" I should remove before mining as I think they would skew the word frequency.

Given the "Filter Stopwords (Dictionary)" can only remove 1 stopword per line how do I handle a case like wanting to remove "Assessment and Grading" but still keep the word assessment and the word grading if they are located elsewhere in the document and how do I expand it so I can add other sentences I need removed

Find more posts tagged with

Sort by:
1 - 4 of 41
    Sounds like Remove Document Parts?

    ~Martin
    User: "mob"
    New Altair Community Member
    OP
    I thought that was for pulling out text I wanted to process further. Can I use it to dump repeated strings from the main text?
    There is Remove and Keep Document Parts, one is throwing out parts of a document, the other keeps documents. Both can be configured with a regex.

    If you have an example set with keywords, you can use aggregate with concat on it to generate a Regex. This is a bit the manual way, but i think it is doable.

    ~Martin
    User: "mob"
    New Altair Community Member
    OP
    Actually following testing my assumption about filter stopwords appears incorrect. You can add the entire phrase as a "stop word" 1 per line and it will be removed e.g. linguistic sentences.