How to set up model to categorize texts

gstar · January 2014

Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:

To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.

*the documents are relatively short and contain between 50 and 200 words

So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3. :-[

Thanks for any input!
Gstar

MariusHelf · January 2014

Hi Gstar,

for text mining Naive Bayes or a linear SVM usually do a good job.
Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.

Best regards,
Marius

gstar · January 2014

Great. Tanks! I'll try it and report back later!

gstar · January 2014

Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
Naive bayes performs worse.
I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).

Is there a workaround?

MariusHelf · January 2014

The operator Polynominal by Binominal classification is your friend in this case

Best regards,
Marius

How to set up model to categorize texts

Answers

Categories