Filter tokens (by Pos Tags) without generating n-grams

HeikoeWin786
HeikoeWin786 New Altair Community Member
edited November 2024 in Community Q&A
Dear all,

I am performing process documents from data using tokenize, transform cases, filter tokens by length, filter stopwords (English), stem(Porter) and filter token (by pos tags).

It is taking so long to run like almost 6 hours.

I am not sure if I am doing things incorrectly.
May I know if it is ok to use Filter tokens (by Pos Tags) without generating n-grams? or, we must generate the n-grams first?

thanks
Tagged:

Best Answer

  • jacobcybulski
    jacobcybulski New Altair Community Member
    edited December 2020 Answer ✓
    I think this is happening because Porter (alike Snowball) stemmer is algorithmic and does not create parts of speech tags. For the POS filter to work you may need to use a dictionary-based stemmer, such as WordNet. Try to skip the POS filter and see if this makes any difference.

Answers

  • jacobcybulski
    jacobcybulski New Altair Community Member
    edited December 2020 Answer ✓
    I think this is happening because Porter (alike Snowball) stemmer is algorithmic and does not create parts of speech tags. For the POS filter to work you may need to use a dictionary-based stemmer, such as WordNet. Try to skip the POS filter and see if this makes any difference.
  • jacobcybulski
    jacobcybulski New Altair Community Member
    Also as a test, downsample your documents to just a few to see if it goes through these at all. 
  • jacobcybulski
    jacobcybulski New Altair Community Member
    edited December 2020
    n-grams make tokens out of a pair or 3 tokens which commonly go together, such as not-bad - they will have no impact on pos-tag filtering. However, if you use n-grams it will slow processing considerably. 
  • HeikoeWin786
    HeikoeWin786 New Altair Community Member
    @jacobcybulski

    Thanks a lot. Based on your input, I did some research and I am very much clear now.