🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Feature Selection

User: "npapan69"
New Altair Community Member
Updated by Jocelyn
Hi everyone,
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance

Find more posts tagged with

Sort by:
1 - 2 of 21
    User: "varunm1"
    New Altair Community Member
    Accepted Answer
    Updated by varunm1
    Hello @npapan69

    The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.
    User: "rfuentealba"
    New Altair Community Member
    Accepted Answer
    Updated by rfuentealba
    Hello, again!

    Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").

    TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.

    Despite my lapsus (and understanding the question), I can now focus on this:
    My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
    Let's see:
    • On each cross validation fold, the selected features will differ.
    • In the RapidMiner documentation for the Cross Validation operator, it says:
    Also the number of iterations that will take place is the same as the number of folds. If the model output port is connected, the Training subprocess is repeated one more time with all Examples to build the final model.
    So, the correct answer is: You get the model trained with all the data (not a specific fold), but only if you connect the mod port somewhere else. The model trained will use the best features found for all of these, though.

    All the best,

    Rodrigo.