text classification problem with non mutually-exclusive classes

mete
mete New Altair Community Member
edited November 5 in Community Q&A
hi everyone!

i have got a bunch of documents and probabilities of its belonging to a specific class.
E.g.:       text    C1    C2    C3    C4    C5    C6    C7        bla bla...    10%    20%    60%    80%    0%    5%    30% 
I want to train a model wich could predict these probabilities out of a given text.

As you can see the documents have non mutually exclusive classes only a probabilitiy of its belonging. One can also see these probabilities do not add up to 100!!!


To get in touch with rapidminer  i have preprocessed the documents (tokenzie, filter... ) and give them (mutually exclusive) labels.
E.g.:       text    label        bla..    C1        lorem..    C2        ipsum    C7 
Then i have weighted these documents the SVM weighter an take only those beyond a specific treshold (other featureselection methods, like forward or backward selection, did not find an ending after several hours)
Afterwards i have trained a SVM-Model and made 10-fold Crossvalidation.
Which performed pretty well, with an accuarcy of 93%...


But in the end, i still have no solution to my initial problem and no clue how to proceed:
  • should i try to get these probabilities out of the confidence vlaue from the svm some how? Is this possibile? And how?
  • or train 7 linear regression models to predict these probabilities. But how to find a proper featureselection by over 2000 terms?
  • or try it with a bayesian model which should give the probability of a class?

Thank you in advance for your hints and suggestions!

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hello!

    First of all - Did you use pruning in the Process Document operator? That way you might get rid of some unuseful attributes. Furthermore you should filter for stopwords etc.
    should i try to get these probabilities out of the confidence vlaue from the svm some how? Is this possibile? And how?
    If you use X-Predction instead of X-Validation you get an example set including the confidences, that might help
    or train 7 linear regression models to predict these probabilities. But how to find a proper featureselection by over 2000 terms?
    I don't think a linear regression model works well on text data. You could however try to use the SVM in the regression "mode". Simply use a numerical label with the standard SVM of rapidminer, than it does a regression instead of a classifcation

    The proper feature selection is tricky. Of course a Forward Selection will not work on 2000 attributes. The first two steps might include 2000*1999 steps. I like the idea of the weight by SVM.

    or try it with a bayesian model which should give the probability of a class?
    A baysian model might work. Additionally you could try an k-NN with cosine similarity. But this might take a while for the apply model.