operating generate N-Grams (terms)

Fred12
Fred12 New Altair Community Member
edited November 2024 in Community Q&A

hi,

I would like to know how the n-grams are generated, I noticed, some words are grouped together as n-gram (terms), and some others are not (single words), how does it decide which terms group together and which not? many of the most frequent occuring terms have no n-gram groupings...

Tagged:

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    The way n-grams works is like this if you set it to 2.  It will make combinations of the following sentence "RapidMiner Studio is the best."

     

    RapidMiner_Studio

    Studio_is

    is_the

    the_best

     

    Assuming your corpus of documents is about RapidMiner Studio reviews and you have TF-IDF set as your word vector creation, it will likely give "is_the" a very low value and "RapidMiner_Studio" and "the_best" as higher values. Of course if you have stemming, filtering, and pruning set, it might just drop out "is_the" completely out, and that's probably what's happening with your process.

  • Fred12
    Fred12 New Altair Community Member

    well inside process documents operator, I had tokenize, stemming, stopwords and n-gram operator, but this might have been the cause...

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.