Interpretation of X-Validation

New Altair Community Member

Jun 21, 2010

Hi there,

As I see it your example will deliver the model created in the last pass through the NN learner; I see that the help tab says I should be able to make a model on the whole data set, in which case that would be where the weights get established, but I cannot see that option in the parameters tab.

land

New Altair Community Member

Hi,
if you retrieve the model from the outgoing port of the XValidation, then a model is trained on the complete data set. You will notice that in the status bar: After learning / applying the model n times, it will be learned a n+1 time.
This behavior isn't parameter dependent anymore. It will be produced if the outgoing port is connected (and hence the model will be used later on)

Greetings,
Sebastian

New Altair Community Member

As usual Seb, spot on, I was too lazy to check the code, and stand corrected. However....

Given that the performance vectors are generated over data subsets within the validation, and given that the model is generated on the entire dataset, there is every chance that the delivered model will perform better than the average of performances within the validation. It actually happens with the code posted above, and if I add some noise it becomes more obvious, It is only a matter of time before a wannabe bug hunter chews this over.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="395" width="620">
      <operator activated="true" class="retrieve" compatibility="5.0.0" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="add_noise" compatibility="5.0.8" expanded="true" height="94" name="Add Noise" width="90" x="45" y="165">
        <parameter key="default_attribute_noise" value="0.05"/>
        <list key="noise"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.0.0" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="Validation" width="90" x="380" y="165">
        <process expanded="true" height="391" width="294">
          <operator activated="true" class="neural_net" compatibility="5.0.0" expanded="true" height="76" name="Neural Net" width="90" x="94" y="37">
            <list key="hidden_layers"/>
          </operator>
          <connect from_port="training" to_op="Neural Net" to_port="training set"/>
          <connect from_op="Neural Net" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="404" width="346">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="24" y="33">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Performance" width="90" x="112" y="120"/>
          <operator activated="true" class="log" compatibility="5.0.0" expanded="true" height="76" name="Log" width="90" x="112" y="255">
            <parameter key="filename" value="C:\Dokumente und Einstellungen\ich\Desktop\Test.log"/>
            <list key="log">
              <parameter key="Performance" value="operator.Performance.value.performance"/>
              <parameter key="Round" value="operator.Apply Model.value.applycount"/>
              <parameter key="Average Performance" value="operator.Validation.value.performance"/>
            </list>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_op="Log" to_port="through 1"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="75">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="5.0.8" expanded="true" height="76" name="Performance (2)" width="90" x="692" y="49">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Add Noise" to_port="example set input"/>
      <connect from_op="Add Noise" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Validation" to_port="training"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
      <connect from_op="Apply Model (2)" from_port="model" to_port="result 3"/>
      <connect from_op="Performance (2)" from_port="performance" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

But there is a real point in the title of this thread, namely how should we interpret the results of validation, what do we think we get out of it, and so on. So a quick flip to Wikipedia for a concensus view...

http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

and a bit I find relevant is this..

The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model.

I can see that the performance reported fits the bill, seen against unseen etc., but what about the model? Surely it would be better calculated in the same way, as some sort of average perhaps, or the optimum. or .. Either way the data scope of the performance and the model should be matched by default, or am I missing quite a lot ( on balance the much more entertaining and likely possibility ;D ) ?

New Altair Community Member

Hi,

It is only a matter of time before a wannabe bug hunter chews this over.

And this was actually already be chewed on several times - even in our old forum at SourceForge...

I completely agree on

The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model.

Hence, the main goal is the estimation of the performance and not the creation of the model. Let's just assume, RapidMiner would not provide an output port for the complete model. What would we analysts do then? Is there a natural model which we prefer over the others?

As you have said: we have several options. I am just starting a discussion about those:

selecting the worst one: no idea why I should do this - this model is very likely to underperform and the performance is likely to be overestimated.
selecting the best one: very risky! What if all models are not really good but at least not predicting randomly (let's assume 55% for a stratified binominal classification) and one model is predicting at random and achieves 56% just by chance. Is this one really more suitable than the others? And additionally the performance could be underestimated (which is at least probably better than in scenario 1)
selecting an average model: ok, but how to do this for all model classes and types? And how to ensure that we don't introduce a bias by choosing the aggregation function?
selecting a model randomly from one of the folds: seems weird but I would prefer this directly after using the model build on the complete data set since I would expect that on average this would result in the model providing a performance closest to the estimated one if you repeat this often enough
learning the model on the complete data set (the RapidMiner way):using all information as possible for increasing the likelihood of obtaining an optimal model. Performance is more likely to be under- than overestimated (which is better in most application areas) and - more important - in the limit the estimated performance and the performance of the model become the same (consider leave-one-out, here the difference between used training data sets is minimized)

So the question can be broken down to: which model should be used? Each analyst is free to decide for one of those or a completely different way - which is possible. I believe in the last option stated above: and that's the reason why we have implemented the convenient output behaviour the way is is

Cheers,
Ingo

New Altair Community Member

Hola,

Point taken, but should it not be made clear that the performance delivered is not that of the model delivered? Folks could easily get confused...

New Altair Community Member

Hi,

Point taken, but should it not be made clear that the performance delivered is not that of the model delivered? Folks could easily get confused...

ok, we could do that. But I would assume that now even more people are confused: isn't the whole purpose of cross validation that of delivering exactly this? The performance of the model which would be used later for scoring? Of course it is not delivering exactly that performance: that is why the whole scheme is called performance estimation and not performance calculation...

To conclude, I would suggest to add a statement to the description of the operator making clearer that a) the performance is only an estimation of the model which is usually built on the complete data set and b) that this model is delivered at the output port for convenience reasons.

What do you think?

Cheers,
Ingo

New Altair Community Member

Hola,

That's cool, as long as people understand how the numbers are made, and even better why, then they can rest easy in their beds..

PS As food for thought, if you run this you'll see that a difference remains between the all-data performance and the average validation even down to leave-one-out.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="395" width="815">
      <operator activated="true" class="retrieve" compatibility="5.0.0" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="false" class="add_noise" compatibility="5.0.8" expanded="true" height="94" name="Add Noise" width="90" x="45" y="165">
        <parameter key="default_attribute_noise" value="0.05"/>
        <list key="noise"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.0.0" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="Validation" width="90" x="380" y="165">
        <parameter key="leave_one_out" value="true"/>
        <process expanded="true" height="391" width="294">
          <operator activated="true" class="neural_net" compatibility="5.0.0" expanded="true" height="76" name="Neural Net" width="90" x="94" y="37">
            <list key="hidden_layers"/>
          </operator>
          <connect from_port="training" to_op="Neural Net" to_port="training set"/>
          <connect from_op="Neural Net" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="404" width="346">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="24" y="33">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Validation (2)" width="90" x="179" y="30"/>
          <operator activated="false" class="log" compatibility="5.0.0" expanded="true" height="60" name="Log" width="90" x="112" y="255">
            <parameter key="filename" value="C:\Dokumente und Einstellungen\ich\Desktop\Test.log"/>
            <list key="log">
              <parameter key="Performance" value="operator.Validation (2).value.performance"/>
              <parameter key="Round" value="operator.Apply Model.value.applycount"/>
              <parameter key="Average Performance" value="operator.Validation.value.performance"/>
            </list>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Validation (2)" to_port="labelled data"/>
          <connect from_op="Validation (2)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="75">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="5.0.8" expanded="true" height="76" name="AllData" width="90" x="715" y="75">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Validation" to_port="training"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="AllData" to_port="labelled data"/>
      <connect from_op="AllData" from_port="performance" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="36"/>
      <portSpacing port="sink_result 2" spacing="108"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

That difference interests me, is it actually a form of "margin of error" definition?

New Altair Community Member

All right then. I have just added those statements to the operator's documentation.

Cheers,
Ingo

New Altair Community Member

Sorry mate, just added some thoughts for the way home.. Many thanks for your forbearance..

New Altair Community Member

Hi,

PS As food for thought, if you run this you'll see that a difference remains between the all-data performance and the average validation even down to leave-one-out. That difference interests me, is it actually a form of "margin of error" definition?

I would assume that it still the difference between training error (neural networks are pretty good at overfitting) by having learned the model on the complete set and applied it on the same set - compared to the - probably better estimated - LOO error. So there actually should be a difference - and it should become larger the more overfitting occurs, right?

Cheers,
Ingo

New Altair Community Member

So there actually should be a difference - and it should become larger the more overfitting occurs, right?

And can we use that possibility in some deliciously eclectic way? Interesting, the game has changed towards not overlearning - it used to be the reverse!