How to set up model to categorize texts
gstar
New Altair Community Member
Hi folks, beeing a relative new bee to rapid miner, I would like to achieve the following task:
To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.
*the documents are relatively short and contain between 50 and 200 words
So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3. :-[
Thanks for any input!
Gstar
To set up a process that
1) does text mining* to find out the most common words within a category of text (e.g. recipes for beef, vegetables, etc.)
2) feeds the different results for each category into a model to teach the model the text category
3) takes an unknown text (e.g. a recipe for beef stock) and compares it to the model to find out the corresponding category.
*the documents are relatively short and contain between 50 and 200 words
So far I accomplished the text mining process quite well.
Choosing the right model seems challenging.
A decision tree model comes up with a plausible model. However, the the branches do not expose y/n (word exists / does not exist). Instead I am just presented statistics for decision making that I can not use for step 3. :-[
Thanks for any input!
Gstar
Tagged:
0
Answers
-
Hi Gstar,
for text mining Naive Bayes or a linear SVM usually do a good job.
Don't forget to optimize the C parameter of the SVM using Optimize Parameters (Grid). Usually a range between 1e-4 and 1 on a logarithmic scale is a good starting point. Expand the range if the detected optimum is near the limits of the range.
Best regards,
Marius0 -
Great. Tanks! I'll try it and report back later!0
-
Working with 5 categories, so far i got the best results with a k-nn model using overlap similarities and k=5.
Naive bayes performs worse.
I cannot get SVM (linear) to work, since it does not support polynominal labels (i.e. 5 different labels in my case).
Is there a workaround?0 -
The operator Polynominal by Binominal classification is your friend in this case
Best regards,
Marius0