[SOLVED] Build preliminary model with SVM for filtering

kasper2304
kasper2304 New Altair Community Member
edited November 5 in Community Q&A
Hi out there.

The case is that I am building a model for text classification. I have around 500.000 forum posts that I want to classify as if they contain an idea or not. I have already classified 300 posts where 9 of them were positive cases. The idea is to use the already classified posts to build a "bad" model that can just give me a better clue about which posts contains positive cases. Having done that the idea is to use this preliminary bad model as a filter to pick out around 3000 posts thats should be manually classified.

I have two questions I would like to ask.

1) How to pick out a good number of variables to work with. Fx principal components, decision tree or just taking the highly correlated variables?
2) How to setup the support vector machine in rapid miner (I expect nothing than hints and i have watched the video)

I have document term matrix as both binary weighted and weighted by tf-idf score. Simply want to try out both and see how they perform.

Any help is welcome

Best
Kasper

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi Kasper, let's start with your second question, as that's the easy one. For text classification you probably want to use a linear svm. In this case, the only parameter to be optimized is C. A good range is from 1E-6 to 1E3 on a logarithmic scale. You can do that easily with the Optimize Parameters (Grid) operator in combination with a Cross Validation. An SVM with rbf kernel has an additional important parameter called gamma, which has to be optimized in the same range as C.

    For the first question there is no clear answer. You have mainly two choices: using statistical approaches, or using experimental approaches. An example for the latter one is the Forward Selection, which has a subprocess that usually contains a cross validation for the model of your choice. During execution, the forward selection starts with an empty example set, and iteratively adds that attribute from the original data which yields the highest performance gain. The algorithm stops when by adding a new attribute no significant gain can be achieved. This is usually much slower than the statistical approaches, but has the advantage that it catches attributes interactions and is inherently optimized for your specific model creation method.
    The statistical approaches, on the other hand, are often much faster, but often don't find attribute interactions. You already mentioned some of them, and the only hint I can give you, is to try them out :)

    Happy Mining!
    ~Marius
  • kasper2304
    kasper2304 New Altair Community Member
    Thx Marius for the very nice reply.

    Until now the best model i have is a logistic regression with a "dot" kernel and X-validation. For dimensionality reduction i did PCA on 50 extracted terms and reduced then number of terms to 20, which i then used for my logistic regression. Recalling something my teacher said about that traditional regression models were good for small datasets... I cannot really recall whether he actually said that but i reasoned that there must be a reason for why they are commonly used within scientific research and statistics...

    One interesting thing i discovered playing around is that i had 3 persons classifying my training set and instead of only using the cases with "full agreement" i also used the cases where two of the human classifiers as positive cases. this gives my 12 positive cases on which my logistic regression performs a lot better than with only "pure" positive cases, which makes good sense in my world. Before going with you approach Marius i have the following results:

    class recall(0) = 96,53%
    class recall(1) = 41,67%
    class precision(0) = 97,54%
    class precision(1) = 33,33%

    I will try the SVM approach you told me. Most of the literature suggests SVM so i am a bit puzzled that it did not perform that well when i first tried it out, but maybe with your settings it will.

    Again thx for you answer.

    My professor i busy at the moment so i cannot expect much help from him, so you help is really appreciated!
  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    common use is not a good indicator for quality. There are thousands of examples where the masses are led into a complete wrong direction, most often by some cool buzzwords. Think for example of Neural Nets, which are a horrible beast, hard to interpret and to optimize. Nevertheless most newbies to data mining ask for neural nets, even though a very nice alternative in form of the SVM exists.

    Concerning your bad performance: SVMs need to be optimized, as written in my post above. Btw, RapidMiner's logistic regression is powered internally by an SVM with a special kernel type.

    Happy Optimizing!
    ~Marius
  • kasper2304
    kasper2304 New Altair Community Member
    Yes. Neural nets can be horrible, especially in terms of over fitting.

    Nice info about the logistic regression. Being a relatively new data miner i never heard about it before though. Think i am gonna look into that. Any good sources? My best results at this point is the logistic regression with dot kernel, x validation with standard settings. Think i am gonna go with the that and use the optimize node on that as well.

    The only concern i have left at this point is the risk of over fitting the small dataset i have. When i have decided what model to use the idea is to use the model to extract 3000 new posts based on the model i just build, and get 2 persons to classify them manually. This will of course give a lot more positive cases, which is nice as positive cases are a bit rare within my problem. The only concern i have left now is that the model i use for my filter for 3000 cases should not be TOO fitted for the 300 cases dataset, but i dont really know how to assess that. Any ideas?

    Just ran the logistic regression with dot kernel optimized with the settings you suggest and it is the one that performed best the until now. Think i am going in the right direction here! :)