Another query on document classifcation and assigning of weights to keywords
Hi,
Thanks for the response earlier.
I have a couple of more questions on document classification although unrelated to what I asked last time around.
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.
+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Thanks.
Regards,
Sharath