Parts of Speech (POS) Filtering

Hyram
Hyram New Altair Community Member
edited November 5 in Community Q&A
Hi. 

I am have tokenised some text and am now trying to remove POS, using the Filter by POS Operator. I have used the following expression: N.*|VB.*|RB.*|JJ.*|MD.*|PP.* in an attempt to keep nouns, adjectives, verbs and adverbs. The problem is that as an example, nouns and verbs were filtered out e.g. the word "need" is no longer present in my text. 
What am I doing wrong and do I have the right expression for the POS tokens I want to keep (nouns, adjectives, verbs and adverbs)?

Thanks,
Hyram 

Best Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    As you seem to know, you need to use the PENN POS tags, which are available here:
    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    Your expression doesn't look like is has anything obviously wrong with it to me.
    @kayman can you take a look and see if you can find anything wrong?
    Also you might try doing one selection at a time to see if there is a problem with the compound expression?
    Or try filtering out specific tags rather than keeping only certain tags?
  • kayman
    kayman New Altair Community Member
    Answer ✓
    Hi @Hyram , @Telcontar120

    Seems ok at first glance indeed.
    As recommended by Brian try to do the same without filters (or filter one by one) so you get an understanding which tag is given to 'need' by RM. POS tags are sensitive to word location etc, so depending on the string the same word can get different tags.

    Need can be both verb or noun for instance, but since you are capturing both it shouldn't be a problem to start with.
    You didn't accidentally select the invert option?

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    As you seem to know, you need to use the PENN POS tags, which are available here:
    https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    Your expression doesn't look like is has anything obviously wrong with it to me.
    @kayman can you take a look and see if you can find anything wrong?
    Also you might try doing one selection at a time to see if there is a problem with the compound expression?
    Or try filtering out specific tags rather than keeping only certain tags?
  • kayman
    kayman New Altair Community Member
    Answer ✓
    Hi @Hyram , @Telcontar120

    Seems ok at first glance indeed.
    As recommended by Brian try to do the same without filters (or filter one by one) so you get an understanding which tag is given to 'need' by RM. POS tags are sensitive to word location etc, so depending on the string the same word can get different tags.

    Need can be both verb or noun for instance, but since you are capturing both it shouldn't be a problem to start with.
    You didn't accidentally select the invert option?
  • Hyram
    Hyram New Altair Community Member
    Thank you very much @Telcontar120 and @kayman. I will try what you've suggested. I didn't mistakenly select invert. I've also subsequently found that some of the words reside in the dictionary I'm using for stop words and hence some of them have been filtered out accordingly.
  • Penelopemax
    Penelopemax New Altair Community Member
    Having tokenized text, I'm attempting to remove parts of speech (POS) using the Filter by POS Operator. The expression is used to retain nouns, adjectives, verbs, and adverbs. However, I encountered an issue where, for example, the word "need" was filtered out. Seeking a solution to preserve essential words like "need" in the filtered text.