Text mining classification with multiple classes
Hi,
I am relatively new to data science and therefore I have some questions:
I’m working on a text mining multi-class classification problem for a study assignment. The aim of my assignment is to build a model that predicts the ‘score’ attribute of textual reviews of products. The possible ‘score’ attribute values (classes) are 1,2,3,4 or 5, so it is like a star rating of reviews. My dataset contains 6 features:
- ReviewerID, ReviewerName, ReviewText, Score, Summary and the length of my textual review.
- There are 5000 reviews (rows) in my dataset and a few missing values (ReviewerName)
- 3000 reviews are 5 star reviews, 1000 reviews are 4 star reviews and the rest of the reviews is a 1, 2 or 3 star review. The classes are imbalanced.
- I've uploaded the dataset
I have used various classification methods (kNN, naïve Bayes and Logistic regression SVM) but I cannot seem to achieve a higher accuracy of my model that 62%. I don’t know if this is a good accuracy or not, the random guess in 20% but I have the idea that there are things I can do to make a more accurate model. If I try to rebalance the dataset the accuracy drops to max 40%.
The process is: Read CSV (using quotes) -> numerical to polynomial > set role (‘score’ as label) > nominal to text > select attributes (reviewer ID is left out) > split data (70%/30%) > process documents (tokenize, stem, filter stop words, transform cases, generate n-grams (2)) > cross validation 10 fold -> KNN) > performance)
I don’t know if miss steps in my process or that I make mistakes or maybe 62% accuracy is the max. I hope that someone can help me out or give me tips!
Thanks!
Greetings Marijn