Performance result: Training vs Test
HeikoeWin786
New Altair Community Member
Dear all,
I am new to rapidMiner and I wanted to perform NBC on airline dataset. I have a airline dataset with labelled data of sentiment (pos, neg, and netural). I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords). However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Then, after I implemented the NBC, I receive accuracy of 87% for training but 0.00% accuracy for the test dataset.
Can you please kindly help me to understand what I am missing here?
Thanks a lot in advance!
I am new to rapidMiner and I wanted to perform NBC on airline dataset. I have a airline dataset with labelled data of sentiment (pos, neg, and netural). I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords). However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Then, after I implemented the NBC, I receive accuracy of 87% for training but 0.00% accuracy for the test dataset.
Can you please kindly help me to understand what I am missing here?
Thanks a lot in advance!
Tagged:
0
Best Answer
-
It's not really possible to diagnose the problem just from looking at this screenshot.
Here are some things you could check.
Did you perform the text pre-processing and processing before or after the split? If before, then there should be no issue, but if after, then you probably need to replicate the wordlist from the training set to the test set, otherwise the model inputs will not be consistent?
What tool did you use to get the sentiment? I recommend Extract Sentiment which is now part of the Text Processing extension.
5
Answers
-
It's not really possible to diagnose the problem just from looking at this screenshot.
Here are some things you could check.
Did you perform the text pre-processing and processing before or after the split? If before, then there should be no issue, but if after, then you probably need to replicate the wordlist from the training set to the test set, otherwise the model inputs will not be consistent?
What tool did you use to get the sentiment? I recommend Extract Sentiment which is now part of the Text Processing extension.
5 -
@Telcontar120
Thanks a lot.
I revisit the whole process, I split the data and for test data, I used the word output from text pre-processing from train dataset. Then I received the result. But, the result for train data and test data is the same. Is this normal?
E.g. Train data --> Text preprocessing (store the word output) --> NBC
Test data --> Text preprocessing (input the word output from above step) --> NBC
The accuracy is 65% for both process, that is ideal?
thanks and regards,
Heikoe0 -
It is impossible to say without seeing the data. It is certainly possible.0
-
Hi Heiko,seems for me as if the label gets somehow lost. Can you check if the word-list still provides a label attribute (it is marked in some green column) in the word-training data set? You can also check for the roles. Some operators skip special attributes like the label and it gets lostAlso if you split 25-75 between test and training it would be interesting to see this in the same process. If you do it always like this in the same process you prevent yourself to process the trainings-data somehow differently then the test-data.0
-
@aengler
Thanks a lot for the explanation. Yes, I had followed the same process. And, every time my result for test and training (SVM or NBC) returns almost the same result.
I was a bit unsure if that is ideal thats why.
thanks much,
Heikoe0