"Text Classification into many categories, wrong approach?"
biancanevo
New Altair Community Member
Hi all,
I am new to RapidMiner communuty.
I recently discovered this software and I would like to taks the developers to provide us with such an amazing tool.
I am using RapidMiner for textclassification.
I am a bit puzzled on the right approach to the problem.
I have a set of data like:
DESCRIPTION - LABEL
"problems with phone" - "CBCM ISSUE"
"0506565665 did not recharge" - "Data Issue"
and so on.
Problem is that I have 100+ different categories and I have 25000+ records to use.
basically I divided the set in training set (20000) and 5000 i try to classify them.
The data is extremely noisy, same input is also many times classified under different categories.
I use the typeical approach to text classification: (lowercase, tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming). the vectors are used to train a K-NN classifier (I tried also polynomial to binamial with SVN) and i apply the classification model to the remaining data that i created using the same procedure described before (tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming) both vectors terms are weighted usinf Term Frequency.
However, I do not go over 26-30% accuracy. I assume that i am having a wrong approach to the problem.
I am trying to calssify too many categories and i believe that this way i will not get over this level of accuracy.
So I need to try a diferent parroach.
I was thinking to train one model for each category and then to apply all the models to the data to classify and to select the best result.
Don't know if this approach make sense to you or if you can suggest something different.
Regards!
I am new to RapidMiner communuty.
I recently discovered this software and I would like to taks the developers to provide us with such an amazing tool.
I am using RapidMiner for textclassification.
I am a bit puzzled on the right approach to the problem.
I have a set of data like:
DESCRIPTION - LABEL
"problems with phone" - "CBCM ISSUE"
"0506565665 did not recharge" - "Data Issue"
and so on.
Problem is that I have 100+ different categories and I have 25000+ records to use.
basically I divided the set in training set (20000) and 5000 i try to classify them.
The data is extremely noisy, same input is also many times classified under different categories.
I use the typeical approach to text classification: (lowercase, tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming). the vectors are used to train a K-NN classifier (I tried also polynomial to binamial with SVN) and i apply the classification model to the remaining data that i created using the same procedure described before (tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming) both vectors terms are weighted usinf Term Frequency.
However, I do not go over 26-30% accuracy. I assume that i am having a wrong approach to the problem.
I am trying to calssify too many categories and i believe that this way i will not get over this level of accuracy.
So I need to try a diferent parroach.
I was thinking to train one model for each category and then to apply all the models to the data to classify and to select the best result.
Don't know if this approach make sense to you or if you can suggest something different.
Regards!
Tagged:
0
Answers
-
Hi,
first of all: thanks for your kind words. We always really appreciate if people like working with our product!
Before I start I must admit that requests like yours leave a bit the scope of this forum and resembles more consultancy than technical support. However, please find below some comments and hints which might help. But don't expect too much: it is hardly possible to condense years of experience in a few lines only without having seen the data- first a short comment: if the classes are more or less equally distributed, 30% might not be that bad (at least it's 30 times "better" than just guessing) taking into account that data seems to be noisy and this might be caused by the fact that humans maybe are not better for this task.
- it's often more about preprocessing than about learning. However, just try one or two different SVM with a linear kernel (often the best selection for text classification) and vary the important parameters to be sure that you don't give away too much by not properly selected and tuned learning schemes
- make sure that exactly the same term space is used for modeling and scoring
- try with and without stop word removal and stemming, also try n-grams. Latter might be important for texts of lower qualities
- you only have about 200 texts for each class. This is not really much - are text at least equally distributed? Try over-sampling seldom classes by text windowing approaches if necessary
- try different vectorization schemes, especially TFIDF instead of mere term frequency
- make sure that you have used appropriate distance measures for text classification if applicable (e.g. for K-NN)
- sometimes grouping of classes first, maybe even into a hierarchy of classes (if possible) deliver much better results
- especially if the data is extremely noisy and the amount of data is restricted you should consider postprocessing like multiple predicted based on the confidence scores or handling uncertainty
Cheers,
Ingo0 -
Why you did "replacing numbers with Constant words in order to reduce term space"?0