RM 9.1 feedback : Let's talk of the new Automatic Feature Engineering (FS) - Part 2

Question

Hi, This topic of feature selection definitely inspires me : 1/ Optimize Selection (Evolutionary) operator vs AFE operator : If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2 operators. It is not the case. For example, here the results with the Titanic dataset and a DT model : - with OS (Evol) ==> acc = 81,20 % / feature set = 5 features - with ASE (with "balance for accuracy" = 1)==> acc= 79,07 % / feature set = 1 feature Why ASE did not conclude the same feature set and in fine obtains the same performance ? 2/ Unexpected results with the "balance for accuracy" parameter of the AFE operator : Always with the Titanic dataset / DT model : When we set "Balance for accuracy" = 0 (so we expect the simplier feature set) , we obtain the ......original dataset ! : and when we set "Balance for accuracy" = 1 , we obtain : Why this last feature set is not obtained with "balance for accuracy" = 0 ? From my point of view, the resulting feature sets are not consistent with the value of "balance for accuracy" parameter... 3/ The tutorial associated to the AFE operator is broken : there are missing links between some operators... 4/ Performance output port of AFE :: There is a performance output port inside the AFE operator but there is no performance output port outside the operator : Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated ? In conclusion, can you provide some clarifications to all these items ? Thanks you for your listening, Regards, Lionel NB : The process :

IngoRM · Accepted Answer

Hi @lionelderkrikor,Ok, now to part 2 of the comments.  Thanks again BTW.
1) "Optimize Selection (Evolutionary) operator vs AFE operator - If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2"
No, they are actually not the same.  The new operator uses the same basic concepts but different techniques for selection, mutation, and generation.  It also uses some improved heuristics for stopping criteria and added multistarts which should lead to better results faster in most cases.  "Most cases" since those are still randomized heuristics so there are no guarantees but it worked very well on the 20+ test data sets we have been analyzing and comparing and never showed statistically significant poorer performances (but sometimes performed significantly better).
In addition, there seems to be a bug (see below) in the final model selection which does not always occur but does in your test case (see below and also the other thread on the "shift" issue).
2) "Unexpected results with the "balance for accuracy" parameter of the AFE operator"
I am 99% sure that this is the result of the "shifting" bug which sometimes occur during the model selection.  You can see the same problem in the visualization of the Pareto front in AM as you have pointed out before.

3) "The tutorial associated to the AFE operator is broken : there are missing links between some operators..."
Yes, thanks.  This has already been fixed in the recent development build and will be part of the next release.
4) "Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated?"
Exactly.  Well, not necessarily cross-validated but at least validated on a test set at all.  The inner performance is the "training error" of the feature engineering.  As you know I am a strong believer that looking after training errors is a sure recipe for disaster which is why we do not deliver it outside here to avoid problems with it in the first place.  If you absolutely want to see it, you can use the the third port which all the logged results or use the logging mechanism of RapidMiner.  So we do not hide it, we just make it a bit harder to misuse it ;-)

Hope this helps and we will certainly have a look into the shifting bug (point 2 above) asap.Thanks,
Ingo

IngoRM · Accepted Answer

BTW, here is a somewhat simplified process based on yours which uses classification error instead of accuracy. However, without the shifting bug fix this can still lead to weird behaviors in certain situations.