"Naive Bayes for Text Classification"

cvidal
cvidal New Altair Community Member
edited November 5 in Community Q&A
Hello,

I'm trying to apply naive bayes to classifiy some texts and I have two questions about how rapidminer (v5.0.13) implement this classifier:

1.- As far as I know, one of the most frequently used classifier applied to text classification is multinomial naive bayes. The model obtained when using the naive bayes operator is composed by a set of means and standard deviations for the words of my corpus... So, which kind of naive bayes classifier is implemented in rapidminer (Multinomial, Gaussian, Bernouilli)?

2.- I have seen several examples of text classification applying naive bayes in rapidminer. Some of them uses the TF-IDF matrix as input when creating the model and when applying the model. I understand  that TF-IDF values are used to make the model. However, I suppose that TF-IDF values are not used when applying the model (It would not make sense)... In fact, the "process documents" operator receive a Word List as input that modifies the "apply model" output. So,
    a) Is it relevant how texts are vectorized (tf, tfidf, term occurrences) when applying naive bayes model?
    b) Why does "process documents" operator receive a Word List, and how it is used when applying the model?

Thank you in advance.

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi!

    For 1: Gaussian

    For 2: There are two things to consider:
    1. Which attributes to create? The Tokenize creates attributes for every word available in your documents (which is not pruned). In the apply phase you do not want to create attributes for words which were not in the training set and vice versa. So this is similar to the preprocessing model in Nominal to Numiercal.
    2. TF-IDF contains some normalization. This needs to be applied in the apply phase as well.

    cheers,
    Martin
  • cvidal
    cvidal New Altair Community Member
    Thanks for your response.

    I still do not see how can TF-IDF can be applied as input of "apply model" operator. I will try to explain myself:

    If I understand TF-IDF correctly, it makes sense to calculate it when dealing with several (the more the better) documents. TF can be calculated for a single document, but IDF takes into account the rest of the documents of the corpus. So TF-IDF values will vary depending on the entire corpus.

    If this is correct, there are several scenarios where applying tf-idf is not a good option, for example:
        a) I want to classify only one comment (all the attributes-words values for tfidf matrix will be 0).
        b) If I change the corpus tfidf values for one comment will change, so the classification could (probably will) change.

    I have made some tests and I have seen that using the wordlist as input for process documents make the apply model operator change its output. I am not sure but it seems that when using the word list as input, the output (classification) is the same, regardless of the way (tf, tfidf, etc.) the vectors were created.

    Can you help me clarifying this?

    Thanks.