LibSVMLearner one class classification

Sorry if this has been asked before. I did a search, but could not find any relevant information. I was wondering if anyone has successfully run the LibSVM one-class for outlier detection. Every time I try it throws exceptions, mostly Null pointer. I have tried adjusting all the settings, sometimes resulting in other errors, like index out of bounds. I tried it on my own data that is all numeric except a binomial label and I have tried it on this example data I came across https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011498.html

If anyone can provide an example that works(data and settings), that might be very helpful in figuring out what I am doing wrong.

Thanks in advance

John R

Find more posts tagged with

AI Studio

Classification

Accepted answers

All comments

IngoRM

Hello John,

we never worked with the one-class LibSVM before (I always asked myself why I should not simply use one the density based outlier detection operators instead...). So today was the first time we tried it and....

...you are right. It did not work properly. So we worked a bit on better supporting the one-class SVM and you can get the result via CVS (please refer to http://rapid-i.com/content/view/25/48/ for a description of how to access the latest developer version via CVS) or simply wait until the next release.

With the update, you can for example perform a process like this one (the first operator simply create a data set):


<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="number_examples"	value="200"/>
        <parameter key="number_of_attributes"	value="2"/>
        <parameter key="target_function"	value="gaussian mixture clusters"/>
    </operator>
    <operator name="FeatureNameFilter" class="FeatureNameFilter">
        <parameter key="filter_special_features"	value="true"/>
        <parameter key="skip_features_with_name"	value="label"/>
    </operator>
    <operator name="FeatureGeneration" class="FeatureGeneration">
        <list key="functions">
          <parameter key="class"	value="const[0]()"/>
        </list>
        <parameter key="keep_all"	value="true"/>
    </operator>
    <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
        <parameter key="attribute_name_regex"	value="class"/>
        <operator name="Numeric2Polynominal" class="Numeric2Polynominal">
        </operator>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="class"/>
        <parameter key="target_role"	value="label"/>
    </operator>
    <operator name="LibSVMLearner" class="LibSVMLearner">
        <parameter key="C"	value="100.0"/>
        <list key="class_weights">
        </list>
        <parameter key="gamma"	value="1.0"/>
        <parameter key="keep_example_set"	value="true"/>
        <parameter key="svm_type"	value="one-class"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>

The confidence attribute will then show the degree of outlier-ness of each data point. Of course you can also apply the model on completely unseen data. The result could like in the attached plot.

Cheers,
Ingo

[attachment deleted by admin]

haddock

Hi Ingo,

I'm also using libSVM , but for regression. I get nice results in a sliding window validation. So I save/overwrite the model in each pass. But when I load some other data and the model, and try to apply the model against the data, something weird happens... The prediction is always the same as the last time it was applied during the validation! So I get the same prediction for each example in the new data. :-\

Do you think these problems are related?

IngoRM

Hi,

actually I do not think that these problems are related. The problem of the one-class SVM was that is was actually not supported at all ;D

Actually, your problems sounds to me like a combination of two other problems (not related to RapidMiner in particular but to data mining in general):

1. it might be that the training sets are too small together with
2. maybe C is choosen too high and so the model is overfitted and just repeats the prediction

Alternatively, the model has not learnt anything at all and just returns the default prediction (this would indicate that C is too low or the kernel is not appropriate). Since you overwrite the model each time in the validation, the prediction of the last default model would then be repeated independently of the actual values. Thinking over again, I find it more likely that this is the reason for this. Possible solutions include the optimization of parameters, changing the learner, usage of more training examples among several others.

Cheers,
Ingo

haddock

Thanks Ingo!

You're indeed a clever fellow - I had optimised the parameters before, and was a bit suspicious of the results, and I had been unable to replicate the problem using generators for the example set. Ah well

Evoll

Thank You

I was able to download the HEAD from cvs and compile it in eclipse. I just replaced the set generator with an example set and some filters. I was able to run the iris test set no problem, but I have no test set to check the real accuracy. On my own data it didn't do any better than LOF, distance based, or density based. Although there was one odd quirk where it wouldn't produce any results, just question marks, if both of two specific attributes were selected. If one or the other was selected it produced output, but the two attributes were meant to be analyzed independent of each other anyway, so it was not a big deal. It just took me awhile to figure out what was going on.

My school project deals with finding outliers so it was good to have one more method for comparison. I also noticed there is more information for each operation, which is nice since I was able to reference some papers for the algorithms. I was curious if either of the following papers was the one for the density based outlier detection, as the notes seem to imply.

Knorr, E. M., and Ng, R. T. Algorithms for mining distance-based
outliers in large datasets. In VLDB '98: Proceedings of the 24rd Interna-
tional Conference on Very Large Data Bases (San Francisco, CA, USA,
1998), Morgan Kaufmann Publishers Inc., pp. 392-403.

Knorr, E. M., Ng, R. T., and Tucakov, V. Distance-based outliers:
algorithms and applications. The VLDB Journal 8, 3-4 (2000), 237-253.

Thank You Again

John R

IngoRM

Hi John,

that's exactly my point when I said "I always asked myself why I should not simply use one the density based outlier detection operators instead...)". In my opinion, the learning of a SVM model hardly improves the quality of outlier detection but anyway... At least it is great that you have one additional scheme for comparison

The density base operator definitely relies on one of the two cited papers but I don't remember exactly on which one exactly, sorry.

Cheers,
Ingo