XV test on subest of training data

noah977 · February 2009

I have an unusual need of XValidation. I want to train on a full set of data, but only test against a subset of data. I'll explain. We are looking at runners of a race. We are less concerned with the accuracy of the model of all the runner and are more interested in the accuracy of the first three. (I don't care if the model predicts the last two runners correctly. I care how well it predicts winners.)

So, my model would ideally look like this.


  <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="number_of_validations"        value="4"/>
        <parameter key="sampling_type"        value="shuffled sampling"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <parameter key="C"        value="0"/>
            <parameter key="gamma"        value="0"/>
            <parameter key="svm_type"        value="nu-SVR"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ExampleFilter" class="ExampleFilter">
                <parameter key="parameter_string"        value="rank&lt;4"/>
            </operator>
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
                <parameter key="keep_model"        value="true"/>
            </operator>
            <operator name="Performance" class="Performance">
                <parameter key="keep_example_set"        value="true"/>
            </operator>
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="XV"        value="operator.XValidation.value.iteration"/>
                  <parameter key="Time"        value="operator.XValidation.value.looptime"/>
                  <parameter key="perf"        value="operator.Performance.value.performance"/>
                </list>
            </operator>
        </operator>
    </operator>

THIS DOES NOT WORK.
What happens is that the example set is reduced in the first iteration. After that, it is training with the reduced set. What I need to do is RESET the example set back to the original for each round of SVM learning.
Do you have any suggestions?

steffen · February 2009

Hello Noah

The problem is RapidMiners internal Data Structure, which consists (normally) of the data stored once and an arbritrary number of views on the data. A lot of operators create views, not copies, which is one of the reasons for RapidMiners fast evaluation. However, sometimes operators do not create views and so alter the referenced data (in standard XValidation a view is created for training and test, hence your processing alters the data itself). I am little confused that ExampleFilter alters the data below (makes no sense...), but it is as it is, so...

The only thing I can suggest at this point is:
-> use MaterializeDataInMemory before application of the filter:: This can cause problems if you have any nominal attributes
-> change MaterializeDataInMemory so that the problem mentioned before cannot occur. I suggested such a change in the RapidMiner Bugtracker. Copy the operator, change the mentioned lines and put them in a self created plugin
-> rewrite XValidation so that only full copies are passed to train and test. A lot of more work, but (in my opinion) worth the time.

no quick solution, me sorry

regards,

Steffen

PS: I hope that in the next major release of RM there will be a clearer structure so that views / data alterations can be recognized from the GUI-POV.

land · February 2009

Hi Steffen, Hi Noah,
I'm sorry, but I'm unable to reproduce your problem. If I use your process and generate a few examples before applying the XValidation, then it works just fine. The ExampleFilter does filter correctly, and of course does create a view, so that the performance is estimated on only the filtered examples. In the next iteration the correct subsample of the original exampleset is used for learning and assessing again.

Please take a look on your setup if something else goes wrong.

Greetings,
Sebastian

XV test on subest of training data

Answers

Categories