How to set up model to categorize texts

gstar
gstar New Altair Community Member
edited November 5 in Community Q&A
Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:

To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.

*the documents are relatively short and contain between 50 and 200 words

So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3.  :-[

Thanks for any input!
Gstar
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Gstar,

    for text mining Naive Bayes or a linear SVM usually do a good job.
    Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.

    Best regards,
    Marius
  • gstar
    gstar New Altair Community Member
    Great. Tanks! I'll try it and report back later!
  • gstar
    gstar New Altair Community Member
    Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
    Naive bayes performs worse.
    I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).

    Is there a workaround?
  • MariusHelf
    MariusHelf New Altair Community Member
    The operator Polynominal by Binominal classification is your friend in this case :)

    Best regards,
    Marius