Which Validation operator should be used for model evaluation?

New Altair Community Member

Mar 8, 2017

Updated Nov 5, 2024 by Jocelyn

The Accuracy given by the Performance Vector for Split validation and Cross validation is different. Where Cross validation shows slight improvement in accuracy. Which validation operator is preferred the most in case of model evalation?

Find more posts tagged with

AI Studio

Getting Started

Sort by:

1 - 10 of 101

Thomas_Ott

New Altair Community Member

Accepted Answer

Mar 8, 2017

There are big differences on how Split and Cross Validation operator but the intent is the same, train, test, and measure performance of a model. The Cross Validation operator gives a more honest estimation on how the model would perform on unseen data sets. This is why in the accuracy measure for a CV model you might see 70.00% (+/- 5%). The +/- 5% is essentially one standard deviation of the average 70% accuracy .

Go check out Ingo's paper on model validation to learn more: https://rapidminer.com/resource/correct-model-validation/

khannadh

New Altair Community Member

Mar 15, 2018

Hi Thomas,

I read the article and made a simple process using the Iris data to solution the Parameter Optimization biase.

Just wanted to check if I've done the nesting for the two validations correctly.

Could you please let me know?

Thank You,

Dhruve

Nested Cross Validations - Parameter Optimization Biase.rmp

sgenzer

Altair Employee

Mar 16, 2018

hi @khannadh - I saw your note but mainly wanted to tag @Thomas_Ott so that it gets his attention

So your question is a good one. Your setup was almost correct except that you need to specify the name map parameters in the Set Parameters operator:

Scott

Screen Shot 2018-03-16 at 11.19.55 AM.png

sgenzer

Altair Employee

Mar 16, 2018

I would be very curious what others think on this very important issue, as setups have varied over the years. @Telcontar120? @mschmitz? @yyhuang? @Pavithra_Rao?

Scott

khannadh

New Altair Community Member

Mar 16, 2018

Thanks Scott.

I appreciate the help.

The set parameter is the only step that I dont understand.

What is the operator doing exactly in that step?

khannadh

New Altair Community Member

Mar 16, 2018

@sgenzer

Also, when I set the parameter according to your screenshot, I still get a warning sign, not sure if the problem is fixed.

I've attached a screenshot.

Do you know why this is happening?

Screen Shot 2018-03-16 at 11.56.19 AM.png

Thomas_Ott

New Altair Community Member

Mar 16, 2018

@khannadh at run time with larger datasets, this setup could become slow. I would just put the Cross Validation operator in the Optimize Parameters operator instead of the other way around. This way 10 folds will be come one paramater optimization iteration.

khannadh

New Altair Community Member

Mar 16, 2018

I made some port connections and it seems to have removed the problem.

But I'd still like to understand what exactly is going on?

I have attached the screenshot and process.

If someone could explain, that would be great.

Thank You,

Dhruve

Screen Shot 2018-03-16 at 12.06.07 PM.png

Nested Cross Validations - Parameter Optimization Biase.rmp

Telcontar120

New Altair Community Member

Mar 16, 2018

I tend to agree with @Thomas_Ott here. While I understand the theoretical arguments (at least on some level) in favor of the double-nesting (corss-validation inside optimize paramaters inside cross validation), I don't find that in practice there is a significant difference or advantage to this solution. But as Tom says it can lead to significantly longer run times with larger data sets. I'll also point out that the double-nesting approach is not used in RapidMiner's auto-model processes either.

sgenzer

Altair Employee

Mar 16, 2018

ok thanks @Thomas_Ott @Telcontar120 that was my feeling as well but I appreciate the confirmation. So @khannadh, just to be crystal clear, the approach that is shown in that whitepaper is the "gold standard" but rarely used in practice due to issues pointed out above.

Now to answer your questions...

- The "Set Parameters" literally takes the input parameters on the left (the gray "par" nub) and pushes them into the parameters for another operator in your process by its name. In your process, the name of the operator to which you want to push those parameters is called "Decision Tree (2)", and your Set Parameters operator is, in your process, named "Set Parameters". Hence, in the name map, I put "Set Parameters" in the left side (under "set operator name") and "Decision Tree (2)" on the right side (under "operator name"). That's what that operator does.

- Now as @Telcontar120 and @Thomas_Ott implied, none of us really do this. To be honest, that's the first time I have used "Set Parameters" in a very long time (and I'm on RapidMiner every day). The more "normal" and much simpler way to do this (and the way that I think we all do this), is simply putting Cross-Validation inside "Optimize Parameter (Grid)". Done.

The only other thing that many of us do, to make sure the performance is a true measure, is do an initial split of the data to ensure that you are measuring performance against an unseen "testing" set. Like this: