"[Solved] SVM parameter optimization of C and epsilon"

qwertz
qwertz New Altair Community Member
edited November 5 in Community Q&A
Dear all,

I have never been in the favorable situation to attend lectures on data mining. Thus, I lack the mathematical details on how they actually work. Nevertheless, I'd like to have at least an understanding of what is being influenced by the parameters.

To start into this topic I read some papers and searched the forum. Promising results were:
- "A User's Guide to Support Vector Machines" by Ben-Hur and Weston
- http://www.svms.org/parameters/
- http://rapid-i.com/rapidforum/index.php/topic,5863.0.html


In my case I have the "SVM (linear)" with a linear kernel. In the above mentioned post it is said that for this type of SVM only the parameter C has to be optimized. However, there are yet other parameters that can be set.

So I was wondering
- What effect do the parameters "C", "convergence epsilon" and "epsilon" have on the model at all?
- What are reasonable threshholds for these parameters for a (logarithmic grid) optimization?


I'd appreciate if someone could help with these questions or give a link for an "easy to understand" introduction to SVMs.



Best regards
Sachs
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    depending on what you consider "easy to understand" I would recommend the book "Elements of Statistical Learning" by Tibshirani et.al. to you. It is a very comprehensive introduction to data mining, and given the complexity of the matter it is well written and easy to understand. You can download it for free from the author's website: http://www-stat.stanford.edu/~tibs/ElemStatLearn/
    For serious working with it I would recommend to buy a paper copy, though.

    To answer your questions about the SVM on a very high-level scale:
    Consider a classification problem with a fuzzy, ragged separation line between the two classes. Now when creating a classification model, you can go into two extremes:
    1. you create a model which classifies each and every data point in the training data correctly. That will result in a very complex description of the data: you have to describe every bend and every outlier. To create such a model, the creating algorithm must have "high capacity".
    2. model the data with a single straight line. This results in a very simple model, and you probably make a lot of mistakes on the training data. We say, the algorithm has a low capacity.

    The choice of the capacity depends on the data - if the training data is e.g. very noisy, you do not want an algorithm with a high capacity, because in that case it would model it noise point as part of the model and thus generalize bad to new data. As well does a method with a low capacity have disadvantages on problems which are imminently complex per se.

    Different learning algorithms have different capacities: a neural net for example has a high capacity, whereas a decision tree with only one or two levels has a rather low capacity.

    The cool thing about the SVM is that you can control its capacity with a single parameter, without changing the algorithm itself. That is the C that you can configure.



    Internally, the SVM calculates the model not with a simple formula, but optimizes the model stepwise. When the model converges against a certain measure, the algorithm stops and considers the current model as "good enough". The strictness of this optimization is controlled by the epsilon parameters. Usually you can use the default settings.

    For C, as written elsewhere, a good choice are values in the range from 10^-6 to 10^3. If you see a constant increase towards one of the borders, you should of course increase the search space. Same applies to the kernel_gamma parameter, which has to be optimized when using an SVM with rbf/radial kernel.

    Best regards,
    Marius
  • qwertz
    qwertz New Altair Community Member

    Hi Marius,

    Thank you very much for taking the time to give such helpful and detailed explanation!!
    Your support in this forum is as outstanding as the Rapidminer software itselt!

    May I ask a last question on SVMs? The "SVM (Linear)" operator has got two parameters: "convergence epsilon" and just "epsilon". What is the difference then?


    Best regards
    Sachs
  • MariusHelf
    MariusHelf New Altair Community Member
    As described above, the SVM internally optimizes a hyperplane. In each iteration, the costs (or so-called loss) is calculated, i.e. wrongly classified examples are "counted" in a way. In the next iteration the hyperplane is adjusted according to the errors discovered in the last iteration. The algorithm stops when the hyperplane is "good enough" according to some constraints, the so-called KKT-constraints. "Good enough" here is influenced by the convergence_epsilon.
    The other epsilon values specify a delta around the true value of each example for which a prediction is still considered as "correct".

    But as I said, usually you do not need to touch any of these parameters.

    Best regards,
    Marius
  • qwertz
    qwertz New Altair Community Member

    You are marvelous! Thanks again!

    Best regards
    Sachs