Best method for validating results from a feature selection?
Roberto
New Altair Community Member
Hi all,
So here's my question. I have run a forward and backward feature selection algorithms to strip a dataset of 27,580 attributes down to ~100 that looks to be able to classify my data into 2 categories very well. These selections were wrapped in a WrapperXValidation algorithm, so I have an estimate as to their performance. I now want to test the predictive power of these features...but I do not have a test set at my disposal to do so with. I have been creating a table with only the ~100 features selected by the selection processes and running a simple XValidation on that data, using a leave one out strategy. A statistician told me I should do a 70/30 split on my data, and cross validate that way, but that really limits the amount of samples I can use for training/test sets (only 40 samples). What is the best strategy for cross validating a predictive signature without a true test set?
Here's the basic methodology I went through:
1) Extract features from dataset using forward selection within a WrapperXValidation. (Leave one out strategy)
2) Create new example set based on features selected in step 1, run a backwards selection on the subtable, wrapped within a WrapperXValidation (Leave one out strategy)
3) Create final example set based on final selected features from steps 1 and 2, run SVM wrapped in an XValidation operator (Leave one out strategy).
Thanks,
Roberto
So here's my question. I have run a forward and backward feature selection algorithms to strip a dataset of 27,580 attributes down to ~100 that looks to be able to classify my data into 2 categories very well. These selections were wrapped in a WrapperXValidation algorithm, so I have an estimate as to their performance. I now want to test the predictive power of these features...but I do not have a test set at my disposal to do so with. I have been creating a table with only the ~100 features selected by the selection processes and running a simple XValidation on that data, using a leave one out strategy. A statistician told me I should do a 70/30 split on my data, and cross validate that way, but that really limits the amount of samples I can use for training/test sets (only 40 samples). What is the best strategy for cross validating a predictive signature without a true test set?
Here's the basic methodology I went through:
1) Extract features from dataset using forward selection within a WrapperXValidation. (Leave one out strategy)
2) Create new example set based on features selected in step 1, run a backwards selection on the subtable, wrapped within a WrapperXValidation (Leave one out strategy)
3) Create final example set based on final selected features from steps 1 and 2, run SVM wrapped in an XValidation operator (Leave one out strategy).
Thanks,
Roberto
Tagged:
0
Answers
-
Hi Roberto,
cross-validation (XValidation) is a good estimation scheme for your purpose. However, I would not recommed the 70:30 splitting scheme. The standard 90:10 scheme in a 10-fold cross-validation is preferable, i.e. provides more accurate estimates. And the leave-one-out scheme delivers even more accurate estimates and has the smallest bias of all mentioned schemes and hence is preferable for small data sets. In RapidMiner, the XValidation operator has an option to activate the leave-one-out scheme instead of the standard cross-validation scheme.
Best regards,
Ralf0