Find more posts tagged with
hi @khannadh - I saw your note but mainly wanted to tag @Thomas_Ott so that it gets his attention
So your question is a good one. Your setup was almost correct except that you need to specify the name map parameters in the Set Parameters operator:
Scott
I would be very curious what others think on this very important issue, as setups have varied over the years. @Telcontar120? @mschmitz? @yyhuang? @Pavithra_Rao?
Scott
Also, when I set the parameter according to your screenshot, I still get a warning sign, not sure if the problem is fixed.
I've attached a screenshot.
Do you know why this is happening?
@khannadh at run time with larger datasets, this setup could become slow. I would just put the Cross Validation operator in the Optimize Parameters operator instead of the other way around. This way 10 folds will be come one paramater optimization iteration.
I tend to agree with @Thomas_Ott here. While I understand the theoretical arguments (at least on some level) in favor of the double-nesting (corss-validation inside optimize paramaters inside cross validation), I don't find that in practice there is a significant difference or advantage to this solution. But as Tom says it can lead to significantly longer run times with larger data sets. I'll also point out that the double-nesting approach is not used in RapidMiner's auto-model processes either.
ok thanks @Thomas_Ott @Telcontar120 that was my feeling as well but I appreciate the confirmation. So @khannadh, just to be crystal clear, the approach that is shown in that whitepaper is the "gold standard" but rarely used in practice due to issues pointed out above.
Now to answer your questions...
- The "Set Parameters" literally takes the input parameters on the left (the gray "par" nub) and pushes them into the parameters for another operator in your process by its name. In your process, the name of the operator to which you want to push those parameters is called "Decision Tree (2)", and your Set Parameters operator is, in your process, named "Set Parameters". Hence, in the name map, I put "Set Parameters" in the left side (under "set operator name") and "Decision Tree (2)" on the right side (under "operator name"). That's what that operator does.
- Now as @Telcontar120 and @Thomas_Ott implied, none of us really do this. To be honest, that's the first time I have used "Set Parameters" in a very long time (and I'm on RapidMiner every day). The more "normal" and much simpler way to do this (and the way that I think we all do this), is simply putting Cross-Validation inside "Optimize Parameter (Grid)". Done.
The only other thing that many of us do, to make sure the performance is a true measure, is do an initial split of the data to ensure that you are measuring performance against an unseen "testing" set. Like this:
I usually do a 70/30 split but this often depends on who's doing it, and what the data set is like.
Good luck!
Scott
There are big differences on how Split and Cross Validation operator but the intent is the same, train, test, and measure performance of a model. The Cross Validation operator gives a more honest estimation on how the model would perform on unseen data sets. This is why in the accuracy measure for a CV model you might see 70.00% (+/- 5%). The +/- 5% is essentially one standard deviation of the average 70% accuracy .
Go check out Ingo's paper on model validation to learn more: https://rapidminer.com/resource/correct-model-validation/
There are big differences on how Split and Cross Validation operator but the intent is the same, train, test, and measure performance of a model. The Cross Validation operator gives a more honest estimation on how the model would perform on unseen data sets. This is why in the accuracy measure for a CV model you might see 70.00% (+/- 5%). The +/- 5% is essentially one standard deviation of the average 70% accuracy .
Go check out Ingo's paper on model validation to learn more: https://rapidminer.com/resource/correct-model-validation/