🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Prune a large set of features in case of text classification

User: "In777"
New Altair Community Member
Updated by Jocelyn

I am dealing with the binary text classifcation task. I've done several preprocessing steps for my training data (stopwords, stem, morphology, low case, n-grams creation etc.) and created TFIDF-Vector. I deleted the rare n-grams (prune belowe 5%)  and got 18000 n-grams. The choice of cuttoff is arbitrary and its borthers me.  Then I've applied linear C-SVM (LibSVM). Unfortenately, the accuracy of my model for test set is very low. I think, I have to many feautes left and want to reduce their amount. So I decided to use information gain to reduce all features to most informative words. So I used operator "Weight by information gain" and then "Select by Weights" after "Process Documents"-Operator. At the and I used cross-validation with the linear SVM in it. But I got an error that the sample does not include the meta data. I am not sure what am I doing wrong and how to improve it.

Besides, what is the best way to prune a large set of features down to a manageable set of the most discriminative features and how to implement it in Rapdiminer? How else can I improve the performance of my model?

 

Find more posts tagged with