Another query on document classifcation and assigning of weights to keywords
S_
New Altair Community Member
Hi,
Thanks for the response earlier.
I have a couple of more questions on document classification although unrelated to what I asked last time around.
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.
+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Thanks.
Regards,
Sharath
Thanks for the response earlier.
I have a couple of more questions on document classification although unrelated to what I asked last time around.
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.
+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Thanks.
Regards,
Sharath
0
Answers
-
Hi again
Kind of. First thing is, that the model has a prior. in case it does know anything it might predict the most frequent class. So for this i would balance it
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
Further accuracy as a measure is highly class balance dependend. If you have unbalanced data, accuracay becomes hard to interpret.
In personally think that it is not that important, because most stop words are thrown out by TF/IDF or Feature selection
Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
so you would simply count? Yes it is. I built a process like this somewhere here in the forum.As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Btw: Have you tried a linear SVM?0 -
Thanks a lot for the prompt response Michael! This really helps.
I think you missed you missed out on responding to my query on assigning weights. Would appreciate if you could respond to this one as well.
Thanks.
Regards,
Sharath0 -
Hi,
the answer is basicly you can not add weights for attributes, only for examples. The reason for it is that most models choos his weights "by its own". Think about a linear regression. Their you do not want to change the coefficients ( ~weights) by your own.
The only thing you can do is dupicating attributes.
Best,
Martin0 -
In my case I have only two columns - 1. Subject + Content of an email 2. Email label (the category to which it belongs to - Operations, Finance etc.)
Could you please elaborate a bit on what you mean by adding weights to examples and not attributes with reference to my case above?
Also when you say duplicating attributes do you mean duplicating certain mails (in my case) that are very descriptive and have a lot of keywords before building a model?
Thanks again.
Sharath0 -
One more thing, when I say adding weights I do not refer to the coefficients of a model but something similar to oversampling and undersampling (i.e. giving more weight to certain records that are more descriptive than some of the others).0
-
oh, in that case:
add another coloumn with Generate attributes and set the role of it to weight. Then all learners who can handle weights will use them.0