"Text Classification into many categories, wrong approach?"

New Altair Community Member
Updated by Jocelyn
Hi all,
I am new to RapidMiner communuty.
I recently discovered this software and I would like to taks the developers to provide us with such an amazing tool.
I am using RapidMiner for textclassification.
I am a bit puzzled on the right approach to the problem.
I have a set of data like:
DESCRIPTION - LABEL
"problems with phone" - "CBCM ISSUE"
"0506565665 did not recharge" - "Data Issue"
and so on.
Problem is that I have 100+ different categories and I have 25000+ records to use.
basically I divided the set in training set (20000) and 5000 i try to classify them.
The data is extremely noisy, same input is also many times classified under different categories.
I use the typeical approach to text classification: (lowercase, tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming). the vectors are used to train a K-NN classifier (I tried also polynomial to binamial with SVN) and i apply the classification model to the remaining data that i created using the same procedure described before (tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming) both vectors terms are weighted usinf Term Frequency.
However, I do not go over 26-30% accuracy. I assume that i am having a wrong approach to the problem.
I am trying to calssify too many categories and i believe that this way i will not get over this level of accuracy.
So I need to try a diferent parroach.
I was thinking to train one model for each category and then to apply all the models to the data to classify and to select the best result.
Don't know if this approach make sense to you or if you can suggest something different.
Regards!
I am new to RapidMiner communuty.
I recently discovered this software and I would like to taks the developers to provide us with such an amazing tool.
I am using RapidMiner for textclassification.
I am a bit puzzled on the right approach to the problem.
I have a set of data like:
DESCRIPTION - LABEL
"problems with phone" - "CBCM ISSUE"
"0506565665 did not recharge" - "Data Issue"
and so on.
Problem is that I have 100+ different categories and I have 25000+ records to use.
basically I divided the set in training set (20000) and 5000 i try to classify them.
The data is extremely noisy, same input is also many times classified under different categories.
I use the typeical approach to text classification: (lowercase, tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming). the vectors are used to train a K-NN classifier (I tried also polynomial to binamial with SVN) and i apply the classification model to the remaining data that i created using the same procedure described before (tokenization, replacing numbers with Constant words in order to reduce term space, stop word removal, stemming) both vectors terms are weighted usinf Term Frequency.
However, I do not go over 26-30% accuracy. I assume that i am having a wrong approach to the problem.
I am trying to calssify too many categories and i believe that this way i will not get over this level of accuracy.
So I need to try a diferent parroach.
I was thinking to train one model for each category and then to apply all the models to the data to classify and to select the best result.
Don't know if this approach make sense to you or if you can suggest something different.
Regards!