"pre-processing inside xvalidation loop"
gromgull
New Altair Community Member
Hi,
A student of mine has been developing some pre-processing operators based on the Value Distance Metric, FastMap,etc.
These operators essentially compute some parameters for transformation of data based on the correlation between attributes and the label.
Now - we would like to use xvalidation to evaluate our approach, but this introduces a problem - because the parameters for the transformation are based on the complete data, knowledge about the test data has already crept into the training phase. I have a feeling this is a sort of cheating, i.e. we are no longer measuring performance on unseen data.
This is also a (less severe) problem for PCA, i.e. the eigen vectors are computed based on the whole example-set, and then it is split into training/testing. However, for PCA, the label is not used. And the only "leaking" of information is ensuring what is usually just an assumption, i.e. that training and testing data are drawn from the same probability distribution.
The solution seems to be to apply the pre-processing operator INSIDE the xvalidation loop, i.e. estimate the transformation parameters based on training data only, and then apply the same transformation to testing and training data.
Did anyone else think of this problem? Work around it?
I tried to make this work for PCA in the process below, but face a few technical problems (this is with v4.5, I assume 5 would have similar issues):
1. When I Apply the PCA model in the first part of the xval operator, it seem to remove the attributes also from the test-data. This happens even if I check the "create_view" checkbox.
2. When I come to the second xval sub-operator, I have two models. One for the PCA, and one from the learner. How do I chose which to apply?
Thanks for any feed back!
A student of mine has been developing some pre-processing operators based on the Value Distance Metric, FastMap,etc.
These operators essentially compute some parameters for transformation of data based on the correlation between attributes and the label.
Now - we would like to use xvalidation to evaluate our approach, but this introduces a problem - because the parameters for the transformation are based on the complete data, knowledge about the test data has already crept into the training phase. I have a feeling this is a sort of cheating, i.e. we are no longer measuring performance on unseen data.
This is also a (less severe) problem for PCA, i.e. the eigen vectors are computed based on the whole example-set, and then it is split into training/testing. However, for PCA, the label is not used. And the only "leaking" of information is ensuring what is usually just an assumption, i.e. that training and testing data are drawn from the same probability distribution.
The solution seems to be to apply the pre-processing operator INSIDE the xvalidation loop, i.e. estimate the transformation parameters based on training data only, and then apply the same transformation to testing and training data.
Did anyone else think of this problem? Work around it?
I tried to make this work for PCA in the process below, but face a few technical problems (this is with v4.5, I assume 5 would have similar issues):
1. When I Apply the PCA model in the first part of the xval operator, it seem to remove the attributes also from the test-data. This happens even if I check the "create_view" checkbox.
2. When I come to the second xval sub-operator, I have two models. One for the PCA, and one from the learner. How do I chose which to apply?
Thanks for any feed back!
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="/home/grimnes/projects/organik/workspace/RDFRapidMiner/zxc"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="number_of_validations" value="2"/>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="PCA" class="PCA">
</operator>
<operator name="ApplyPCA" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="NaiveBayes" class="NaiveBayes">
</operator>
</operator>
<operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
<operator name="Apply which model?" class="ModelApplier" breakpoints="before">
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>
Tagged:
0
Answers
-
Hi,
of course there are other people stumbled over this problem earlier. This is one of the major mistakes when estimating performance...
To come around this issue, we have developed the preprocessing models, so that you can apply the preprocessing to the test data like the final model. This has been greatly extended in Version 5.0 so that I would suggest to move to 5.0 anyway. In principle this is doable in RapidMiner 4.x, too. You simply have to build the model during training phase and pass it on to the test phase. Then apply the preprocessing model first and after this the classification model.
I really would suggest to move to 5.0, since the data flow is more explicit there, so that you avoid any confusion of the models...
Greetings,
Sebastian0 -
Hi,
Continuation of the same issue another question regarding the "XValidation":
I am facing the strange behavior at least I didn't understand it. My ExampleTable contains the 195 examples and when I use the "XValidation" (number_of_validation = 2; local_random_seed = 2) the following validation cycles occur.
cycle one:: Train = 97 Test = 98 + model applied + performace evaluated
cycle 2nd :: Train = 98 Test = 97 + model applied + performace evaluated
Here it comes the weird thing, where I am confused:
cycle 3rd :: again in XValidation :- Train = 195 (pre-processing step is applied ) but then it didn't learn the model, it just presumed the already learned model and then finish the process (I mean no test evaluation + no model applied + no performance evaluation)
Can you please give any feedback, what can be possible reason or I have did something wrong?
The process XML is given below :
<operator name="Root" class="Process" expanded="yes">
<operator name="CSVExampleSource" class="CSVExampleSource">
<parameter key="filename" value="C:\examplesetdata.csv"/>
<parameter key="use_comment_characters" value="false"/>
<parameter key="skip_error_lines" value="true"/>
</operator>
<operator name="ChangeAttributeRole" class="ChangeAttributeRole">
<parameter key="name" value="class"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_validations" value="2"/>
<parameter key="local_random_seed" value="2"/>
<operator name="Apply Preprocessing" class="OperatorChain" expanded="yes">
<operator name="FastMap" class="FastMap">
<parameter key="attributes" value=".*"/>
<parameter key="attribute_distance_metric" value="genres"/>
</operator>
<operator name="NaiveBayes" class="NaiveBayes">
<parameter key="keep_example_set" value="true"/>
</operator>
</operator>
<operator name="Performance Measure" class="OperatorChain" expanded="yes">
<operator name="FastMap (2)" class="FastMap">
<parameter key="attributes" value=".*"/>
<parameter key="attribute_distance_metric" value="genres"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="main_criterion" value="accuracy"/>
<parameter key="accuracy" value="true"/>
<parameter key="weighted_mean_recall" value="true"/>
<parameter key="weighted_mean_precision" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>0 -
Hi,
the train model is built a third time, because you checked "create_complete_model". The learning part is then executed again after performing the usual validations, this time on the complete data set to train the best model possible.
By the way: I would recommend to update to RapidMiner 5.0. It's not only much more comfortable, but community support is only given to that version.
Greetings,
Sebastian0