A program to recognize and reward our most engaged community members
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?