🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Text mining classification with multiple classes

User: "marijn_nbr"
New Altair Community Member
Updated by Jocelyn

Hi,

 

I am relatively new to data science and therefore I have some questions:

 

I’m working on a text mining multi-class classification problem for a study assignment. The aim of my assignment is to build a model that predicts the ‘score’ attribute of textual reviews of products. The possible ‘score’ attribute values (classes) are 1,2,3,4 or 5, so it is like a star rating of reviews. My dataset contains 6 features:

  • ReviewerID, ReviewerName, ReviewText, Score, Summary and the length of my textual review.
  • There are 5000 reviews (rows) in my dataset and a few missing values (ReviewerName)
    • 3000 reviews are 5 star reviews, 1000 reviews are 4 star reviews and the rest of the reviews is a 1, 2 or 3 star review. The classes are imbalanced.
  • I've uploaded the dataset

 I have used various classification methods (kNN, naïve Bayes and Logistic regression SVM) but I cannot seem to achieve a higher accuracy of my model that 62%. I don’t know if this is a good accuracy or not, the random guess in 20% but I have the idea that there are things I can do to make a more accurate model. If I try to rebalance the dataset the accuracy drops to max 40%.

 

The process is: Read CSV (using quotes) -> numerical to polynomial > set role (‘score’ as label) > nominal to text > select attributes (reviewer ID is left out) > split data (70%/30%) > process documents (tokenize, stem, filter stop words, transform cases, generate n-grams (2)) > cross validation 10 fold -> KNN) > performance)

 

I don’t know if miss steps in my process or that I make mistakes or maybe 62% accuracy is the max. I hope that someone can help me out or give me tips!

 

Thanks!

 

Greetings Marijn

Find more posts tagged with