"How to collect each performance of a backward elimination"

Question

I am using rapid miner for my data mining research, I used backward elimination for my feature (attribute) selections. I was wondering how to set up the process in order to gather each performance for the backward elimination. For example: feature set one (A, B, C , D, E, F), performance one(…); feature set two(A, B, C, D, E), performance two(…); ….

I am currently processing a data table with 21 features and 157000 items. A brute force feature selection simply overload my computer memory. I was wonder how to find the best combination as well as plot a graph that shows which combination of features performance low, and which combination performance high.

Thanks in advance for your kindly support. :)

land · Answer

Hi John,
well, this seems to be rather difficult without coding. Anyway it could be possible to achieve it. You could build your own small XValidation just by using operators. I will line up the steps here, but it's definitvely beyond the scope of this free support forum to build it for you:
1. Generate a new attribute that will distribute the examples over the folds
2. Loop over each value of this attribute
  2.1 Copy the data set and filter it according to the current value of the previously generated fold attribute: One set matching the value, the other containing non matching.
 2.2 Learn the model on the non matching
 2.3 apply it on the matching.
 2.4 measure performance and store anywhere with regarding to fold number
3. Average all performance measurements

This way it could be achieved. Or you ask for a quote for such an extension of the XValidation and would donate this to the general functionality :)

Greetings,
  Sebastian

JohnQuest · Answer

Dear Sebastian
             It worked, thanks a lot. May I ask another question, how to set up an automatic sampling with x-fold cross validation. 
For example, a data set contain label X(6000 items), label Y(500 items). A 10-fold cross validation split the data to 650 for each fold, we use 9 folds to training and 1 fold for testing. For each fold of the training set, we want to balance the label X and Label Y. 
For example, fold 1 has label Y(50) and label X(600), so we sample 50 out of label X in fold 1 and correct the new sampled fold 1 as label Y(50) and label X(50), same for the rest of 8 folds. Then we use the 9 sampled folds to training and use the 1 unbalanced fold to testing, the expirment loops the training and testing set for all 10 folds and collect the final performance.
Thanks for your kindly support.

Best Regards

John Quest

land · Answer

Hi John,
indeed we don't have the same version. The latest version is 5.0.006 and I would suggest updating if possible. But this isn't the reason for the missing "Process" operator: This one cannot be added by users, since it represents the complete process and is added automatically if creating a new process. 
If you want to use my posted process, copy it from here and paste it into the XML View of RapidMiner. After pressing the apply button, the process will be reconstructed from this xml fragment.

Greetings,
  Sebastian