Tokenization vs N-grams

HeikoeWin786
HeikoeWin786 New Altair Community Member
edited November 2024 in Community Q&A
Hello guys,

I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?

Thanks and regards,
Heikoe
Tagged:

Best Answer

  • kayman
    kayman New Altair Community Member
    Answer ✓
    n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.

     Bi-grams will do fine for sentiment, anything more isn't typically give much added value.

Answers

  • kayman
    kayman New Altair Community Member
    Answer ✓
    n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.

     Bi-grams will do fine for sentiment, anything more isn't typically give much added value.
  • HeikoeWin786
    HeikoeWin786 New Altair Community Member
    @kayman

    Thanks for your clarification here.
    Meaning to say that, we use Bi-grams as a part of data pre-processing.
    i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?

    Thanks and regards,
    Heikoe

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.