A question about naive bayes based text classification
gfyang
New Altair Community Member
Hi,
I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?
However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?
Sincerely yours,
gfyang
I am testing the naive bayes(NB) for text classification. To my understanding, the result should not be affected by the tf-idf vector of the text. Because NB considers the frequency of each term(t) in each category(c), i.e., p(t | c), and this information is stored in WordList, not the term vectors(i.e., the ExampleSet). Right?
However, after I changed the tf-idf values in ExampleSet, for example, by multiplying a weight x, 0<x<1, the accuracy is changed differently according to different weight x. WHY?
Sincerely yours,
gfyang
0
Answers
-
Hi,
NaiveBayes is a general learning algorithm working on tables. You might use it in order to do text classification, but it is applicable on all other problems, too.
Although the original TF-IDF values of the documents were calculated using the word list, Naive Bayes doesn't know them. It just takes the example set into consideration.
On the other hand, if you apply a weight transformation on all examples of the example set in the same way, the naive bayes result shouldn't differ, because it treats all attributes as independent from each other. But there might be some numerical problems in the limits of computer's precision, causing slightly different results.
Greetings,
Sebastian0 -
Hi, Sebastian,
Thank you for the reply.
I tested several experiments. For example, I multiply all the TF-IDF values with the same weight, and then I change the weight, which is applied to all the TF-IDF values again. The results show that such weight adjustment could really change the accuracy, although all the TF-IDF values are adjusted by exactly the same weight.
The results are:
double precision=0.0;
Iterator<Attribute> attributeIterator; // the iterator for all attributes
Iterator<Example> exampleIterator; // the iterator for all examples
// save the text vector into array
double text_array[][] = new double [num_exp][num_att-2];
exampleIterator = exampleSet.iterator(); // move the iterator to the begining
for(int i=0; i<num_exp; i++)
{
Example example = exampleIterator.next(); // read one example
attributeIterator = attributes.allAttributes(); // build the iterator for the attributes
for(int j=0; j<num_att-2; j++) // read all the attributes except that last two
{
Attribute att = attributeIterator.next();
text_array = example.getValue(att); // read the TF-IDF value into array
}
}
// adjust TF-IDF with weights
double fWeight = 0;
for(int i=0; i<20; i++)
{
exampleIterator = exampleSet.iterator(); // move the iterator to the beginning
for(int i2=0; i2<num_exp; i2++)
{
Example example = exampleIterator.next();
attributeIterator = attributes.allAttributes();
for(int j=0; j<num_att-2; j++)
{
Attribute att = attributeIterator.next();
double val = text_array[i2] * fWeight; // adjust the TF-IDF by weight
example.setValue(att, val); // save the adjusted TF-IDF into the ExampleSet
}
}
precision = my_validate_classiciation(); // do classification by naive bayes based on the adjusted TF-IDF
System.out.println("(" + fWeight + "): " + precision + " ");
fWeight += 0.05; // increase the weight
fWeight = roundTwoDecimals(fWeight); // keep two places behind the decimal point
}
It seems that the differences in the results are too large to be ignored, which might not be caused by the computer precision problem.
(weight): precision
(0.0): 0.0
(0.05): 0.3875
(0.1): 0.3125
(0.15): 0.3125
(0.2): 0.3125
(0.25): 0.2875
(0.3): 0.275
(0.35): 0.2625
(0.4): 0.2625
(0.45): 0.2625
(0.5): 0.25
(0.55): 0.25
(0.6): 0.25
(0.65): 0.2375
(0.7): 0.2375
(0.75): 0.2375
(0.8): 0.2375
(0.85): 0.2375
(0.9): 0.2375
(0.95): 0.2375
(1.0): 0.2375
So, I guess that when doing NB classification by RM, this algorithm really reads ExampleSet and has some important calculations based on ExampleSet, which affects the precision directly.
Sincerely yours,
gfyang
0 -
Hi,
which version of rapid miner do you use?
By the way: There are many methods in the rapid miner api, which would make your life simpler...
Greetings,
Sebastian0 -
Hi,
The version of my RM is 4.5.
I am developing a new idea to adjust the text vector, and I want to test this idea on several classic classification methods. I will try the other methods later. Thank you for the help.
Sincerely yours,
gfyang
0