hi everyone!
i have got a bunch of documents and probabilities of its belonging to a specific class.
E.g.:
text C1 C2 C3 C4 C5 C6 C7
bla bla... 10% 20% 60% 80% 0% 5% 30%
I want to train a model wich could predict these probabilities out of a given text.
As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!
To get in touch with rapidminer i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.
E.g.:
text label
bla.. C1
lorem.. C2
ipsum C7
Then i have weighted these documents the SVM weighter an take only those beyond a specific treshold (other featureselection methods, like forward or backward selection, did not find an ending after several hours)
Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.
Which performed pretty well, with an accuarcy of 93%...
But in the end, i still have no solution to my initial problem and no clue how to proceed:
- should i try to get these probabilities out of the confidence vlaue from the svm some how? Is this possibile? And how?
- or train 7 linear regression models to predict these probabilities. But how to find a proper featureselection by over 2000 terms?
- or try it with a bayesian model which should give the probability of a class?
Thank you in advance for your hints and suggestions!