Naive Bayes Probabilities

Kamesh · September 2011

Hi,

I am trying to use Naive Bayes model for classifying text data for sentiment. I created a training model and applying it on test data. I want to know the probabilities Naive Bayes model is assigning to each of the word in the word vector. I have the distribution from the training model for all the words in the training data. But I would like to see the exact probabilities assigned to each of the word in test data for the two categories.
I need this because, I would like to check the impact of Laplace Correction on new words in my test data that are not present in the training word list. I am using "Binary Term Occurrences" during vector creation.
The funny thing that is happening is - when I have one test record, it is classified as negative. But when I add another record to the test data, the first record now is getting classified as positive! I don't understand why introducing another record should change the classification of first record.

Is there anyway to see the exact probabilities for each record in my test data calculated from the model? I am using RapidMiner 5.1.

Kamesh · September 2011

Ok. Since I didn't get any response, I digged into the code a bit to find out what is going on. There seems to be a bug in SimpleDistributionModel.
performPrediction is expecting the same exact word list in training data to be also given in test data (example set). There is a for loop which iterates through each attribute (word) of exampleSet from a list and gets its distribution from distributionProperties array. It assumes the jth element of the list and array refer to same attribute! This happens only if both refer to same attribute list. Here is the relevant code:

			for (Attribute attribute : exampleSet.getAttributes()) {
				double value = example.getValue(attribute);
				if (nominal) {
					if (!Double.isNaN(value)) {
						int intValue = (int) value;
						for (int i = 0; i < numberOfClasses; i++) {
							if (intValue < distributionProperties.length) {
								probabilities += distributionProperties[intValue];
							}
						}
					} else {
						for (int i = 0; i < numberOfClasses; i++) {
							probabilities += distributionProperties[distributionProperties.length - 1];
						}
					}
				} else {
					if (!Double.isNaN(value)) {
						for (int i = 0; i < numberOfClasses; i++) {
							double base = (value - distributionProperties[INDEX_MEAN]) / distributionProperties[INDEX_STANDARD_DEVIATION];
							probabilities -= distributionProperties[INDEX_LOG_FACTOR] + 0.5 * base * base;
						}
					}
				}
				j++;
			}

So it works properly if I include word list generated from training data into my test model. Though this is not a mandatory step, the code works properly only in this case.
Another thing which I observed is, if an attribute is not present in the test data, its value in the word vector becomes zero (when you use Binary Occurances) for that attribute. This algorithm calculates the probability for that word to be zero in various classes and includes it in the total probability. This is not proper Naive Bayes because, Naive Bayes only talks about calculating probability of the attributes that are present in the data. This may actually introduce some bias.

Not sure if I made myself clear!! I hope one of the RapidMiner developers answers my doubts.

Naive Bayes Probabilities

Answers

Categories