Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Confused by the numerical XValidation output

Hi,

Here is my question this time: why the RMS printed by the XValidation decreases with # of validations?

Here is a simple example:

Data set:

X,Y
0, 0.18224201
1, 2.002307783
2, 4.187028114
...
49, 98.21944595

(this is simply Y = 2*X + rand() - 0.5)

Standard XVal experiment:

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="H:\tmp\lin.aml"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="create_complete_model" value="true"/>
<parameter key="keep_example_set" value="true"/>
<parameter key="number_of_validations" value="60"/>
<parameter key="sampling_type" value="shuffled sampling"/>
<operator name="LinearRegression" class="LinearRegression">
<parameter key="feature_selection" value="none"/>
<parameter key="keep_example_set" value="true"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
</operator>

When i increase the number_of_validations, here is what happens:

no_of_val rms_error

10 0.271 +- 0.040
20 0.258 +- 0.087
30 0.248 +- 0.117
40 0.252 +- 0.122
50 0.239 +- 0.140

I would expect, with # of validations, the error should remain about the same (because it's determined by the rand() ) and its uncertainty decrease?

Thanks!

Find more posts tagged with

AI Studio

ETL + Data Prep

Accepted answers

All comments

land

Hi,
increasing the number of validations means, that you divide your dataset into more parts. Since only one part is used for testing and the regression is learned on all other parts, this means you increase the total number of training examples. And hence the dependency is linearly constructed the linear regression captures it more accurate and the error is reduced on the test set.
(This depends on the property of the linear regression to converge with an infinite number of examples constructed with linear dependency into this dependency, if only random error is added.)
But since the test sets now decrease in size, there are more sets consisting of examples with very bad examples, causing a high error and more with very easy examples with low error. So the variance of the error increases, thats why the standard deviation of the RMS increases.

Greetings,
Sebastian