different values for regressionPerformance for the same data

Legacy User · December 2009

Hallo,

I have the problem, that I get different values for regressionPerformance for the attribute.
I have used the model1 (with featureselection) and model 2 (without featureselection - but only with attributefilter

Attribut
Model1: att3 root_mean_sqared_error 0.334 squared_correlaton 10.651
Model2: att3 root_mean_sqared_error 0.326 squared_correlaton 11.189
???

The same attribute (e.g. att3) has different value for regressionPerformance in both models. Can anyone tell me why?

Model 1

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator" breakpoints="after">
<parameter key="target_function" value="sum"/>
</operator>
<operator name="FS" class="FeatureSelection" expanded="yes">
<parameter key="user_result_individual_selection" value="true"/>
<parameter key="keep_best" value="64"/>
<parameter key="maximum_number_of_generations" value="1"/>
<operator name="BootstrappingValidation" class="BootstrappingValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<operator name="LinearRegression" class="LinearRegression">
<parameter key="feature_selection" value="none"/>
</operator>
<operator name="ApplierChain" class="OperatorChain" expanded="yes">
<operator name="Applier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
</operator>
<operator name="RegressionPerformance" class="RegressionPerformance">
<parameter key="main_criterion" value="squared_correlation"/>
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="squared_correlation" value="true"/>
</operator>
</operator>
</operator>
</operator>
</operator>

Model 2

<operator name="Root" class="Process" expanded="yes">
<operator name="Daten laden und vorbereiten" class="OperatorChain" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="sum"/>
</operator>
</operator>
<operator name="Attribute identifizieren, Ranking, Correalation" class="OperatorChain" expanded="yes">
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="att3"/>
</operator>
<operator name="BootstrappingValidation" class="BootstrappingValidation" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<operator name="LinearRegression" class="LinearRegression">
<parameter key="feature_selection" value="none"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier (2)" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="RegressionPerformance" class="RegressionPerformance">
<parameter key="root_mean_squared_error" value="true"/>
<parameter key="absolute_error" value="true"/>
<parameter key="relative_error" value="true"/>
<parameter key="correlation" value="true"/>
<parameter key="squared_correlation" value="true"/>
<parameter key="skip_undefined_labels" value="false"/>
<parameter key="use_example_weights" value="false"/>
</operator>
</operator>
</operator>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
</operator>
</operator>

land · December 2009

Hi,
the quick answer is: Because you have two different processes. Even another usage order of random numbers can affect performance. You could use local_random_seeds to avoid this.

Greetings,
Sebastian

Legacy User · December 2009

dear

I don't no what you are meaning with local_random_seeds.
I have only integrate in model 1 the featureselection. I think, that is a posibility to test alle attributes itself an with combination to find out the best fit
with a linear model. But this is not a random process itself.
I will find out, what are the best attributes for prediction the label. And for this I gues the performance criteria - like the squared-corellation and the root-mean-squared-error.

best regards

Angela

haddock · December 2009

I don't no what you are meaning with local_random_seeds.

Have you really not thought of searching this forum, say on "local_random"?

Legacy User · December 2009

haddock wrote:

Have you really not thought of searching this forum, say on "local_random"?

I now, what local_random_seeds is !
That is also a feature of RapidMiner, which makes it so special.
But please read my entire question ::)

Even a random process should not alter the quality (parameter of the regressionperformance) of each value.
I therefore assume that I can not compare parameter of the regressionsperformance for specific attributes in 2 different modells.

best regards

haddock · December 2009

But please read my entire question

I have, and Seb has answered it, and....?

land · December 2009

Hi Angela,
of course a random sampling of examples affects the measured quality. And a random sampling is done by the BootstrappingValidations. Without the same random number sequence, it is not guaranteed that the same examples are selected. For example if one example which can be perfectly matched is not selected, but a outlier is selected twice, this will affect the performance heavily.
I would recommend using local random seed on your bootstrappingValidations, this should do the trick.

Greetings,
Sebastian

Angela · December 2009

Hi Sebastian,

many thanks for this answer. I have change the local_random_seed from: -1 to other values 1, 10,100 but I get the same values for
squared_correlation for the attributes.

But I found a other way to get the correct squared_correlation from the imfortance values.
Manys thanks for your help.

Angela

different values for regressionPerformance for the same data

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories