"Averaging cross-validation results"
_paul_
New Altair Community Member
Hi,
I've a general and a RapidMiner-specific question concerning the cross-validation.
In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?
I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?
Regards,
Paul
I've a general and a RapidMiner-specific question concerning the cross-validation.
In the meta sample "07_EvolutionaryParameterOptimization" you are performing an evolutionary
parameter optimization for LibSVMLearner based on the performance results from a cross-validation.
Between the EvolutionaryParameterOptimization and XValidation operator you are using the operator
"IteratingPerformanceAverage" . Is it recommended to always use it in order to get more unbiased results?
If so, what is a typical value for the parameter "number_of_validations"?
I would expect that the "IteratingPerformanceAverage" operator modifies the random seed. In the sample mentioned
above it's not clear to me how this happens. The operator "Process" uses the fixed value of "2001" for the parameter
"random_seed". The operator "XValidation" uses "-1" for "local_random_seed", i.e. the global settings. So, it looks to
me that for all iterations of the cross-validation the same seed is used, namely 2001. Wouldn't it make more sense to
use "-1" for "random_seed" in "Process" to have each time a different seed for the validation?
Regards,
Paul
Tagged:
0
Answers
-
Hello Paul
- The IteratingPerformanceAverage is used to average the PerformanceVectors (=output of Crossvalidation). Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.
- The global random generator using seed 2001 is initialized every time you start a process. Hence I can assert, that XValidation uses splits the data another way every time the operator is executed (please note the difference between single operator and the whole process. If you would set the mentioned parameter to a fixed value unequal to -1, it would initialize the random generator every time the operator is executed and hence split the data always the same way. I hope it is clear now.
Steffen
PS: I guess I have found the first topic for the wiki ;D0 -
Hi Steffen,
Would you have a reference (paper/book) for me where I could find Kohavi's suggestion.Yes, this is in general a recommended strategy. Kohavi suggests to repeat a 10-fold Crossvalidation 6-10 times.
Maybe I got it wrong, but I think you meant here "-1" and not "2001", right? To my understanding you wouldThe global random generator using seed 2001 is initialized every time you start a process.
get always the same pseudo-random numbers when you use a fixed value != -1. Using -1 on the other hand
might be a problem when you want to have reproducible results since "always" different seeds are generated.
I think that the most suitable approach combined with the IteratingPerformanceAverage operator would be a mix
of both seed specifications: RapidMiner should perform the cross-validation 6-10 times with different seeds
which are however specified statically. Thus, the results would be reproducible each time you run your process but
on the other hand you would get an average over multiple seeds as validation results which are however not
completely biased to one specific seed.
Is there a way to tell RapidMiner to perform a cross-validation with a set of pre-defined seeds which have to
be defined manually?
Regards,
Paul
0 -
Hello Paul
First of all: -1 means that you use the global random generator, which is (as specified in preferences) initialized with 2001
Then:
The global random generator is initialized with 2001 every time a process is executed (by clicking the arrow button). On the other hand the local generators are initialized with the specified seed (!= -1) every time the operator, where this seed has been specified, is executed. Hence the results are always reproducible.
To use self-specified seeds for IteratingPerformanceAverage, you can type
as argument (RapidMiner Macros, powerful thingi, see the tutorial.pdf for more details ), which replace the seed with the number of current iteration (1,2,3,....)
%{a}
I suggest to continue to play with the rapidminer example processes to see what I mean. I hope I didnt increase your confusion
Regarding Kohvai: Here is the link to its Ph.D. Thesis (http://ai.stanford.edu/~ronnyk/teza.pdf), where you can find a detailed discussion of the issue of validation. Long text, but fun to read.
hope this was helpful
Steffen
0 -
Hi Steffen,
thank you for your help.
What I meant with "not reproducible results" was that using "-1" as global and local seed would always
yield different random numbers due to the system time which usually changes when a process is
executed multiple times.
Paul
0