Order of Performing nested K-fold cross validation
Answers
-
In practice, I don't think many people are putting parameter optimization inside cross-validation. It's just too time consuming. I'd be quite comfortable with a setup where normalization and feature selection occurred within cross-validation, and then the results of that process were fed to an optimization process where cross-validation for model training was occuring inside the parameter optimization operator.
0 -
This is a great question and I remember we had this discussion elsehwere in the threads here. I agreewwith what @Telcontar120 says.
0 -
Thanks @Telcontar120 @Thomas_Ott
Though I have one really stupid question at this point, as I am a bit dumb today
If we normalize or perform feature selection within k-fold x-Validation, this is done k+1 times in total if I remember correctly from Martin's explanation somewhere else: k times (one for each fold) + one more time for full dataset, right? At the same time logic tells me that on each fold we might have slightly different normalization or feature selection?
So far, how do we pull out the preprocessing model out of x-Validation in this case? Just by taking the latest one? My concern is that the same preprocessing model should also be applied on a test set and also propagated to production process (if there's any).
0 -
Correct, with k-fold cross validation, there are k+1 runs, where the final run is on the entire dataset and that is the result that is returned for any model. But conceptually the cross-validation is simply a way to estimate the reliability of your results on unseen data (to avoid overfitting), and as Ingo's post has shown, when you do things like normalization and other preprocessing inside the cross-validation, you get a more realistic view of what your eventual performance would be like. But when you actually go to construct your normalization model or other preprocessing models, that should be performed using the entire dataset.
Feature selection is similar, only there is no prepocessing model that is returned, just a smaller set of attributes that will be used in the final model. And of course a predictive model itself is returned directly from the cross-validation output (once again, the one built on the entire dataset and not any of the individual k folds).
I hope this clarifies!
1 -
Thanks @Telcontar120 you have returned me my sanity this is the way I actually do, just decided to double-check because dumb day that's why
0 -
@kypexin it's a complex rabbit hole but exactly what @Telcontar120 said when it comes to k+1, it's the entire dataset with the average performances for each 'k'.
When I used to teach the RM traning course, this topic (e.g. normalizing inside the X-val) would cause my student's heads to smoke.
1