Order of Performing nested K-fold cross validation

thomas_gadd7
thomas_gadd7 New Altair Community Member
edited November 2024 in Community Q&A

I have been looking at the following tutorial on correct model validation:


 

I'm looking at the section on contamination through feature selection when doing K-fold cross validation. In the section on Accidental Contamination, near the bottom in example 3), it is suggesting to use nested K-fold validation to search for features in a similar way to that which is being suggested in example 2) for the choice of hyperparameters.



My question is: Is there any best practice around whether to do the nested k-fold validation for feature selection first, then to use the selected features for the nested validation on the hyperparameters, or vice versa? I am imagining it will be too computationally expensive to nest all 3 techniques within one another.

 

Can anyone advise on this?

Thank you

Answers

  • kypexin
    kypexin New Altair Community Member

    That's pretty great question, I would also like to see an example of proper multi-level nested validation process in case all steps are needed at once: 

    • normalization
    • feature selection
    • parameters optimization 

     

    @mschmitz ? :)

  • Telcontar120
    Telcontar120 New Altair Community Member

    In practice, I don't think many people are putting parameter optimization inside cross-validation.  It's just too time consuming.  I'd be quite comfortable with a setup where normalization and feature selection occurred within cross-validation, and then the results of that process were fed to an optimization process where cross-validation for model training was occuring inside the parameter optimization operator.

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    This is a great question and I remember we had this discussion elsehwere in the threads here. I agreewwith what @Telcontar120 says.

  • kypexin
    kypexin New Altair Community Member

    Thanks @Telcontar120 @Thomas_Ott 

    Though I have one really stupid question at this point, as I am a bit dumb today :)

    If we normalize or perform feature selection within k-fold x-Validation, this is done k+1 times in total if I remember correctly from Martin's explanation somewhere else: k times (one for each fold) + one more time for full dataset, right? At the same time logic tells me that on each fold we might have slightly different normalization or feature selection?

    So far, how do we pull out the preprocessing model out of x-Validation in this case? Just by taking the latest one? My concern is that the same preprocessing model should also be applied on a test set and also propagated to production process (if there's any). 

  • Telcontar120
    Telcontar120 New Altair Community Member

    Correct, with k-fold cross validation, there are k+1 runs, where the final run is on the entire dataset and that is the result that is returned for any model.  But conceptually the cross-validation is simply a way to estimate the reliability of your results on unseen data (to avoid overfitting), and as Ingo's post has shown, when you do things like normalization and other preprocessing inside the cross-validation, you get a more realistic view of what your eventual performance would be like.  But when you actually go to construct your normalization model or other preprocessing models, that should be performed using the entire dataset.  

    Feature selection is similar, only there is no prepocessing model that is returned, just a smaller set of attributes that will be used in the final model.  And of course a predictive model itself is returned directly from the cross-validation output (once again, the one built on the entire dataset and not any of the individual k folds).

    I hope this clarifies!

  • kypexin
    kypexin New Altair Community Member

    Thanks @Telcontar120 you have returned me my sanity :) this is the way I actually do, just decided to double-check because dumb day that's why :) 

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @kypexin it's a complex rabbit hole but exactly what @Telcontar120 said when it comes to k+1, it's the entire dataset with the average performances for each 'k'.

     

    When I used to teach the RM traning course, this topic (e.g. normalizing inside the X-val) would cause my student's heads to smoke.