Feature Selection
Hi everyone,
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance
It is more than clear that feature selection should take place within the cross-validation operator, in order to avoid leaking the labels if placed outside and prior to the CV operator. My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Thanks in advance
Find more posts tagged with
Sort by:
1 - 4 of
41
- On each cross validation fold, the selected features will differ.
- In the RapidMiner documentation for the Cross Validation operator, it says:
- evaluating a process to arrive at the best model for data;
- evaluating the model to be later deployed.
Hello @npapan69
The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.
The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.
Hello, again!
Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").
TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.
Despite my lapsus (and understanding the question), I can now focus on this:
So, the correct answer is: You get the model trained with all the data (not a specific fold), but only if you connect the mod port somewhere else. The model trained will use the best features found for all of these, though.
All the best,
Rodrigo.
Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").
TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.
Despite my lapsus (and understanding the question), I can now focus on this:
My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Let's see:
Also the number of iterations that will take place is the same as
the number of folds. If the model output port is
connected, the Training subprocess is repeated one more time
with all Examples to build the final model.
|
All the best,
Rodrigo.
I know this has been sorted out before, so let me dig out the confusion out...
I think we have two very different problems here:
I think the selected solution is looking at #1 which aims to evaluate the process capable of generating a deployable model. We believe that the resulting model will perform according to the cross-validation, and so quite correctly feature engineering should be inside the cross-validation loop. In fact, in the process of experimenting we are likely to improve the model or the selection of features - we are doing so in response to our validation results, so we are actively using our knowledge of validation results to make improvements (oh no!).
However, at some point we will create a model using "all" our data and feature engineering will be conducted on "all" data. So what is the performance of that specific model, which uses those specific features, especially that we interfered with the features and the model in this process?!?
Usually, we reserve yet another data partition for doing just that and call this "honest testing", which is no longer in the optimization / improvement loop. So it means that "all" is a relative term, excluding that "honest testing" data partition. Also if we are dealing with millions of data points, I question the sanity of using all data to train a model, and if indeed we only select a good representative sample for model development, we would be left with a very large data partition for multiple-sample testing, to get a better estimate of performance for this particular model with those specific features.
Confusing? -- JacobSort by:
1 - 2 of
21
- On each cross validation fold, the selected features will differ.
- In the RapidMiner documentation for the Cross Validation operator, it says:
Hello @npapan69
The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.
The feature selection technique inside cross validation operator is to generalize results by reducing bias. Yes, as you mentioned there might be 5 different models (in case of 5 fold) with 5 different feature sets built in CV as you are using feature selection inside cross validation operator. The "mod" output of cross-validation in RapidMiner gives you a model trained on the whole input dataset, this means the model you are getting might be different from all the 5 models created during cross-validation.
Hello, again!
Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").
TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.
Despite my lapsus (and understanding the question), I can now focus on this:
So, the correct answer is: You get the model trained with all the data (not a specific fold), but only if you connect the mod port somewhere else. The model trained will use the best features found for all of these, though.
All the best,
Rodrigo.
Stop! Stop! Stop! Don't make that answer the right one! (My pride says "delete your answer", but my OCD says "leave it there").
TIL that it is more common if we put feature selection inside the cross validation process because otherwise it would lead to biased results. Thanks to @varunm1 for the several links he has sent me. I actually got confused (too many hours programming stuff, you know) but this article got clarity for me: https://rapidminer.com/blog/learn-right-way-validate-models-part-4-accidental-contamination/.
Despite my lapsus (and understanding the question), I can now focus on this:
My question is in regard to the fact that for each CV fold maybe the selected features from mRMR, for example, will differ which model is the one that I get on the output?
Let's see:
Also the number of iterations that will take place is the same as
the number of folds. If the model output port is
connected, the Training subprocess is repeated one more time
with all Examples to build the final model.
|
All the best,
Rodrigo.
Answers below:
No, feature selection should be done before the cross validation process, not inside the cross validation process. What you are trying to accomplish will lead to certain example subsets having different columns, and a model that is both unpredictable and poorly trained.
Again, do you mind to share your XML to see what is happening?
All the best,
Rodrigo.