"Text Mining Classification Problem"
bx01z
New Altair Community Member
Hello,
I am able to create a model using RM5, but I do not believe the algorithm I chose is working well. I have tried a number of algorithms, but I have tried SVM, NaiveBayes, W-SMO.
For the document, I Tokenize, Filter Stopwords (english), then Filter Tokens (by length) which is then sent to the classification algorithm.
I then take unlabeled data and process it and it classifies all as the same value.
I have 4 classifications with 500 labeled data for each for training.
Please provide guidance.
Thanks,
Bob
I am able to create a model using RM5, but I do not believe the algorithm I chose is working well. I have tried a number of algorithms, but I have tried SVM, NaiveBayes, W-SMO.
For the document, I Tokenize, Filter Stopwords (english), then Filter Tokens (by length) which is then sent to the classification algorithm.
I then take unlabeled data and process it and it classifies all as the same value.
I have 4 classifications with 500 labeled data for each for training.
Please provide guidance.
Thanks,
Bob
Tagged:
0
Answers
-
Hi Bob,
phew, this question definitely reaches a limit of the amount of support we are able to provide for free in this forum, sorry. Questions like these are usually exactly the field we are working in consulting projects for our customers and often need much more time than just a few minutes of thinking and writing it down in a forum.
However, here are some hints for optimizing:- You could further try different preprocessing techniques like stemming, character or term-n-grams
- If the texts are derived from specific domains, sometimes a dictionary for mapping terms can also help
- You could try to use pruning or other (mild!) feature selection techniques
- Try different modeling schemes and optimize their parameters
- ...
Cheers,
Ingo0 -
Hi Ingo,
Thanks for the reply. Thanks for the suggestions. It's the stuff I've been trying, but I will press on. Perhaps I can ask a few simpler questions about how RM handles things.
1. When doing the data processing, is the label retained for the resulting dataset for each of the terms individually?
2. Is there a place to view accuracy levels of created models applied to the data used to create them?
3. Does RM use LingPipe at all?
Thanks again,
Bob0 -
Hi,
ad 1)
Sorry, I didn't understand this.
ad 2)
Again I am not sure if I got you. However, maybe you mean something like the result history in the result perspective which can show the latest results (just click on the colored bars to open the single results).
ad 3)
At least here I can be clear: LingPipe can not be supported in the free community edition of RapidMiner due to license issues. Although there is a royalty free license of LingPipe, this is not compatible to 100% open source licenses and strategies like that of RapidMiner, sorry. Of course it would be possible to build a custom connector to LingPipe within a customer project.
Cheers,
Ingo0