Hi,
A student of mine has been developing some pre-processing operators based on the Value Distance Metric, FastMap,etc.
These operators essentially compute some parameters for transformation of data based on the correlation between attributes and the label.
Now - we would like to use xvalidation to evaluate our approach, but this introduces a problem - because the parameters for the transformation are based on the complete data, knowledge about the test data has already crept into the training phase. I have a feeling this is a sort of cheating, i.e. we are no longer measuring performance on unseen data.
This is also a (less severe) problem for PCA, i.e. the eigen vectors are computed based on the whole example-set, and then it is split into training/testing. However, for PCA, the label is not used. And the only "leaking" of information is ensuring what is usually just an assumption, i.e. that training and testing data are drawn from the same probability distribution.
The solution seems to be to apply the pre-processing operator INSIDE the xvalidation loop, i.e. estimate the transformation parameters based on training data only, and then apply the same transformation to testing and training data.
Did anyone else think of this problem? Work around it?
I tried to make this work for PCA in the process below, but face a few technical problems (this is with v4.5, I assume 5 would have similar issues):
1. When I Apply the PCA model in the first part of the xval operator, it seem to remove the attributes also from the test-data. This happens even if I check the "create_view" checkbox.
2. When I come to the second xval sub-operator, I have two models. One for the PCA, and one from the learner. How do I chose which to apply?
Thanks for any feed back!
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="/home/grimnes/projects/organik/workspace/RDFRapidMiner/zxc"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="number_of_validations" value="2"/>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="PCA" class="PCA">
</operator>
<operator name="ApplyPCA" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="NaiveBayes" class="NaiveBayes">
</operator>
</operator>
<operator name="OperatorChain (2)" class="OperatorChain" expanded="yes">
<operator name="Apply which model?" class="ModelApplier" breakpoints="before">
<list key="application_parameters">
</list>
<parameter key="create_view" value="true"/>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="accuracy" value="true"/>
<list key="class_weights">
</list>
</operator>
</operator>
</operator>
</operator>