"Increasing text categorization performance through dedicated wordlists"

Question

I have been playing with text categorization over the last few months and I now have a question for which I could not find an answer here on the forums or somewhere else.

My text categorization models have an accuracy of around 62% (SVM, with SVD for dimensionality reduction)

I want to try to improve this by "helping" the learner a little bit. For a category 'Product related' I know all possible products (something RapidMiner - of course - does not know). Another example would be a list of swear words for tagging cases with a category 'Flame'.
Is it possible to help the leaner by connecting or relating wordlists to certain categories?

Thanks for your help!

MariusHelf · Answer

Well, some models apply a stronger weight for attributes with higher values, e.g. k-NN. For most models though that is not true, and for the SVM or Naive Bayes there is no means of providing additional information of that kind to the model creation process.

What you could do, however, is to generate a new attribute which contains the result of the "classification" by keywords as described in my post above, and use that attribute additional to the normal word vector for the creation of the SVM model.

Best regards,
Marius

nennat · Answer

Hi Marius,

Thank you, I had not thought about that approach. But does that mean that it is not possible to help the model by giving it a list of words with a strong relation to a certain label?
Because with the manual assigning of labels I think I will encounter issues with cases that contain specific words from multiple labels. 
How would I deal with this?

MariusHelf · Answer

Hi,

if you really want to create rules, you could use Process Documents with binary term occurences and then use Generate Attributes and Filter Examples to assign labels manually and apply the model only on the remaining documents which are not covered by the manual rules.

Best regards,
Marius