RM 9.1 feedback : Let's talk of the new Automatic Feature Engineering (FS) - Part 2
lionelderkrikor
New Altair Community Member
Hi,
This topic of feature selection definitely inspires me :
1/ Optimize Selection (Evolutionary) operator vs AFE operator :
If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2 operators.
It is not the case. For example, here the results with the Titanic dataset and a DT model :
- with OS (Evol) ==> acc = 81,20 % / feature set = 5 features
- with ASE (with "balance for accuracy" = 1)==> acc= 79,07 % / feature set = 1 feature
Why ASE did not conclude the same feature set and in fine obtains the same performance ?
2/ Unexpected results with the "balance for accuracy" parameter of the AFE operator :
Always with the Titanic dataset / DT model :
When we set "Balance for accuracy" = 0 (so we expect the simplier feature set) , we obtain the ......original dataset ! :
and when we set "Balance for accuracy" = 1 , we obtain :
Why this last feature set is not obtained with "balance for accuracy" = 0 ? From my point of view, the resulting feature sets are not
consistent with the value of "balance for accuracy" parameter...
3/ The tutorial associated to the AFE operator is broken : there are missing links between some operators...
4/ Performance output port of AFE ::
There is a performance output port inside the AFE operator
but there is no performance output port outside the operator :
Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated ?
In conclusion, can you provide some clarifications to all these items ?
Thanks you for your listening,
Regards,
Lionel
NB : The process :
This topic of feature selection definitely inspires me :
1/ Optimize Selection (Evolutionary) operator vs AFE operator :
If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2 operators.
It is not the case. For example, here the results with the Titanic dataset and a DT model :
- with OS (Evol) ==> acc = 81,20 % / feature set = 5 features
- with ASE (with "balance for accuracy" = 1)==> acc= 79,07 % / feature set = 1 feature
Why ASE did not conclude the same feature set and in fine obtains the same performance ?
2/ Unexpected results with the "balance for accuracy" parameter of the AFE operator :
Always with the Titanic dataset / DT model :
When we set "Balance for accuracy" = 0 (so we expect the simplier feature set) , we obtain the ......original dataset ! :
and when we set "Balance for accuracy" = 1 , we obtain :
Why this last feature set is not obtained with "balance for accuracy" = 0 ? From my point of view, the resulting feature sets are not
consistent with the value of "balance for accuracy" parameter...
3/ The tutorial associated to the AFE operator is broken : there are missing links between some operators...
4/ Performance output port of AFE ::
There is a performance output port inside the AFE operator
but there is no performance output port outside the operator :
Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated ?
In conclusion, can you provide some clarifications to all these items ?
Thanks you for your listening,
Regards,
Lionel
NB : The process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="85"> <parameter key="repository_entry" value="//Samples/data/Titanic"/> </operator> <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="85"> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="Ticket Number|Name|Cabin"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="85"> <parameter key="attribute_name" value="Survived"/> <parameter key="target_role" value="label"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="103" name="Multiply" width="90" x="514" y="85"/> <operator activated="true" class="optimize_selection_evolutionary" compatibility="9.1.000" expanded="true" height="103" name="Optimize Selection (Evolutionary)" width="90" x="648" y="85"> <parameter key="use_exact_number_of_attributes" value="false"/> <parameter key="restrict_maximum" value="false"/> <parameter key="min_number_of_attributes" value="1"/> <parameter key="max_number_of_attributes" value="1"/> <parameter key="exact_number_of_attributes" value="1"/> <parameter key="initialize_with_input_weights" value="false"/> <parameter key="population_size" value="5"/> <parameter key="maximum_number_of_generations" value="30"/> <parameter key="use_early_stopping" value="false"/> <parameter key="generations_without_improval" value="2"/> <parameter key="normalize_weights" value="true"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="user_result_individual_selection" value="false"/> <parameter key="show_population_plotter" value="false"/> <parameter key="plot_generations" value="10"/> <parameter key="constraint_draw_range" value="false"/> <parameter key="draw_dominated_points" value="true"/> <parameter key="maximal_fitness" value="Infinity"/> <parameter key="selection_scheme" value="tournament"/> <parameter key="tournament_size" value="0.25"/> <parameter key="start_temperature" value="1.0"/> <parameter key="dynamic_selection_pressure" value="true"/> <parameter key="keep_best_individual" value="false"/> <parameter key="save_intermediate_weights" value="false"/> <parameter key="intermediate_weights_generations" value="10"/> <parameter key="p_initialize" value="0.5"/> <parameter key="p_mutation" value="-1.0"/> <parameter key="p_crossover" value="0.5"/> <parameter key="crossover_type" value="uniform"/> <process expanded="true"> <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" width="90" x="313" y="34"> <parameter key="split_on_batch_attribute" value="false"/> <parameter key="leave_one_out" value="false"/> <parameter key="number_of_folds" value="10"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="85"> <parameter key="criterion" value="gain_ratio"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <connect from_port="training set" to_op="Decision Tree" to_port="training set"/> <connect from_op="Decision Tree" from_port="model" to_port="model"/> <portSpacing port="source_training set" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="true"/> <parameter key="kappa" value="false"/> <parameter key="weighted_mean_recall" value="false"/> <parameter key="weighted_mean_precision" value="false"/> <parameter key="spearman_rho" value="false"/> <parameter key="kendall_tau" value="false"/> <parameter key="absolute_error" value="false"/> <parameter key="relative_error" value="false"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="false"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_mean_squared_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="false"/> <parameter key="correlation" value="false"/> <parameter key="squared_correlation" value="false"/> <parameter key="cross-entropy" value="false"/> <parameter key="margin" value="false"/> <parameter key="soft_margin_loss" value="false"/> <parameter key="logistic_loss" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model" to_port="model"/> <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/> <connect from_op="Performance" from_port="performance" to_port="performance 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_test set results" spacing="0"/> <portSpacing port="sink_performance 1" spacing="0"/> <portSpacing port="sink_performance 2" spacing="0"/> </process> </operator> <connect from_port="example set" to_op="Cross Validation" to_port="example set"/> <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.1.000" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="648" y="289"> <parameter key="mode" value="feature selection"/> <parameter key="balance for accuracy" value="1.0"/> <parameter key="show progress dialog" value="false"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="use optimization heuristics" value="true"/> <parameter key="maximum generations" value="30"/> <parameter key="population size" value="10"/> <parameter key="use multi-starts" value="true"/> <parameter key="number of multi-starts" value="5"/> <parameter key="generations until multi-start" value="10"/> <parameter key="use time limit" value="false"/> <parameter key="time limit in seconds" value="60"/> <parameter key="use subset for generation" value="false"/> <parameter key="maximum function complexity" value="10"/> <parameter key="use_plus" value="false"/> <parameter key="use_diff" value="false"/> <parameter key="use_mult" value="true"/> <parameter key="use_div" value="true"/> <parameter key="reciprocal_value" value="true"/> <parameter key="use_square_roots" value="false"/> <parameter key="use_exp" value="false"/> <parameter key="use_log" value="false"/> <parameter key="use_absolute_values" value="false"/> <parameter key="use_sgn" value="false"/> <parameter key="use_min" value="false"/> <parameter key="use_max" value="false"/> <process expanded="true"> <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="313" y="85"> <parameter key="split_on_batch_attribute" value="false"/> <parameter key="leave_one_out" value="false"/> <parameter key="number_of_folds" value="10"/> <parameter key="sampling_type" value="automatic"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="85"> <parameter key="criterion" value="gain_ratio"/> <parameter key="maximal_depth" value="10"/> <parameter key="apply_pruning" value="true"/> <parameter key="confidence" value="0.1"/> <parameter key="apply_prepruning" value="true"/> <parameter key="minimal_gain" value="0.01"/> <parameter key="minimal_leaf_size" value="2"/> <parameter key="minimal_size_for_split" value="4"/> <parameter key="number_of_prepruning_alternatives" value="3"/> </operator> <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/> <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/> <portSpacing port="source_training set" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true"> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34"> <list key="application_parameters"/> <parameter key="create_view" value="false"/> </operator> <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"> <parameter key="main_criterion" value="first"/> <parameter key="accuracy" value="true"/> <parameter key="classification_error" value="true"/> <parameter key="kappa" value="false"/> <parameter key="weighted_mean_recall" value="false"/> <parameter key="weighted_mean_precision" value="false"/> <parameter key="spearman_rho" value="false"/> <parameter key="kendall_tau" value="false"/> <parameter key="absolute_error" value="false"/> <parameter key="relative_error" value="false"/> <parameter key="relative_error_lenient" value="false"/> <parameter key="relative_error_strict" value="false"/> <parameter key="normalized_absolute_error" value="false"/> <parameter key="root_mean_squared_error" value="false"/> <parameter key="root_relative_squared_error" value="false"/> <parameter key="squared_error" value="false"/> <parameter key="correlation" value="false"/> <parameter key="squared_correlation" value="false"/> <parameter key="cross-entropy" value="false"/> <parameter key="margin" value="false"/> <parameter key="soft_margin_loss" value="false"/> <parameter key="logistic_loss" value="false"/> <parameter key="skip_undefined_labels" value="true"/> <parameter key="use_example_weights" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/> <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_test set results" spacing="0"/> <portSpacing port="sink_performance 1" spacing="0"/> <portSpacing port="sink_performance 2" spacing="0"/> </process> </operator> <operator activated="true" class="remember" compatibility="9.1.000" expanded="true" height="68" name="Remember" width="90" x="447" y="136"> <parameter key="name" value="performance"/> <parameter key="io_object" value="PerformanceVector"/> <parameter key="store_which" value="1"/> <parameter key="remove_from_process" value="true"/> </operator> <connect from_port="example set source" to_op="Cross Validation (2)" to_port="example set"/> <connect from_op="Cross Validation (2)" from_port="performance 1" to_op="Remember" to_port="store"/> <connect from_op="Remember" from_port="stored" to_port="performance sink"/> <portSpacing port="source_example set source" spacing="0"/> <portSpacing port="sink_performance sink" spacing="0"/> </process> </operator> <operator activated="true" class="recall" compatibility="9.1.000" expanded="true" height="68" name="Recall" width="90" x="849" y="340"> <parameter key="name" value="performance"/> <parameter key="io_object" value="PerformanceVector"/> <parameter key="remove_from_store" value="true"/> </operator> <connect from_op="Retrieve Titanic" from_port="output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="Automatic Feature Engineering" to_port="example set in"/> <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_port="result 2"/> <connect from_op="Optimize Selection (Evolutionary)" from_port="performance" to_port="result 1"/> <connect from_op="Automatic Feature Engineering" from_port="feature set" to_port="result 3"/> <connect from_op="Recall" from_port="result" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
Tagged:
1
Best Answers
-
Hi @lionelderkrikor,Ok, now to part 2 of the comments. Thanks again BTW.1) "Optimize Selection (Evolutionary) operator vs AFE operator - If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2"No, they are actually not the same. The new operator uses the same basic concepts but different techniques for selection, mutation, and generation. It also uses some improved heuristics for stopping criteria and added multistarts which should lead to better results faster in most cases. "Most cases" since those are still randomized heuristics so there are no guarantees but it worked very well on the 20+ test data sets we have been analyzing and comparing and never showed statistically significant poorer performances (but sometimes performed significantly better).In addition, there seems to be a bug (see below) in the final model selection which does not always occur but does in your test case (see below and also the other thread on the "shift" issue).2) "Unexpected results with the "balance for accuracy" parameter of the AFE operator"I am 99% sure that this is the result of the "shifting" bug which sometimes occur during the model selection. You can see the same problem in the visualization of the Pareto front in AM as you have pointed out before.3) "The tutorial associated to the AFE operator is broken : there are missing links between some operators..."Yes, thanks. This has already been fixed in the recent development build and will be part of the next release.4) "Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated?"Exactly. Well, not necessarily cross-validated but at least validated on a test set at all. The inner performance is the "training error" of the feature engineering. As you know I am a strong believer that looking after training errors is a sure recipe for disaster which is why we do not deliver it outside here to avoid problems with it in the first place. If you absolutely want to see it, you can use the the third port which all the logged results or use the logging mechanism of RapidMiner. So we do not hide it, we just make it a bit harder to misuse it ;-)Hope this helps and we will certainly have a look into the shifting bug (point 2 above) asap.Thanks,
Ingo5 -
BTW, here is a somewhat simplified process based on yours which uses classification error instead of accuracy. However, without the shifting bug fix this can still lead to weird behaviors in certain situations.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="34"><br> <parameter key="repository_entry" value="//Samples/data/Titanic"/><br> </operator><br> <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34"><br> <parameter key="attribute_filter_type" value="subset"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value="Ticket Number|Name|Cabin"/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="attribute_value"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="time"/><br> <parameter key="block_type" value="attribute_block"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="value_matrix_row_start"/><br> <parameter key="invert_selection" value="true"/><br> <parameter key="include_special_attributes" value="false"/><br> </operator><br> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"><br> <parameter key="attribute_name" value="Survived"/><br> <parameter key="target_role" value="label"/><br> <list key="set_additional_roles"/><br> </operator><br> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.1.001-SNAPSHOT" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="447" y="34"><br> <parameter key="mode" value="feature selection"/><br> <parameter key="balance for accuracy" value="1.0"/><br> <parameter key="show progress dialog" value="false"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="use optimization heuristics" value="true"/><br> <parameter key="maximum generations" value="30"/><br> <parameter key="population size" value="10"/><br> <parameter key="use multi-starts" value="true"/><br> <parameter key="number of multi-starts" value="5"/><br> <parameter key="generations until multi-start" value="10"/><br> <parameter key="use time limit" value="false"/><br> <parameter key="time limit in seconds" value="60"/><br> <parameter key="use subset for generation" value="false"/><br> <parameter key="maximum function complexity" value="10"/><br> <parameter key="use_plus" value="false"/><br> <parameter key="use_diff" value="false"/><br> <parameter key="use_mult" value="true"/><br> <parameter key="use_div" value="true"/><br> <parameter key="reciprocal_value" value="true"/><br> <parameter key="use_square_roots" value="false"/><br> <parameter key="use_exp" value="false"/><br> <parameter key="use_log" value="false"/><br> <parameter key="use_absolute_values" value="false"/><br> <parameter key="use_sgn" value="false"/><br> <parameter key="use_min" value="false"/><br> <parameter key="use_max" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="45" y="34"><br> <parameter key="split_on_batch_attribute" value="false"/><br> <parameter key="leave_one_out" value="false"/><br> <parameter key="number_of_folds" value="10"/><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="85"><br> <parameter key="criterion" value="gain_ratio"/><br> <parameter key="maximal_depth" value="10"/><br> <parameter key="apply_pruning" value="true"/><br> <parameter key="confidence" value="0.1"/><br> <parameter key="apply_prepruning" value="true"/><br> <parameter key="minimal_gain" value="0.01"/><br> <parameter key="minimal_leaf_size" value="2"/><br> <parameter key="minimal_size_for_split" value="4"/><br> <parameter key="number_of_prepruning_alternatives" value="3"/><br> </operator><br> <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/><br> <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/><br> <portSpacing port="source_training set" spacing="0"/><br> <portSpacing port="sink_model" spacing="0"/><br> <portSpacing port="sink_through 1" spacing="0"/><br> </process><br> <process expanded="true"><br> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"><br> <parameter key="main_criterion" value="first"/><br> <parameter key="accuracy" value="false"/><br> <parameter key="classification_error" value="true"/><br> <parameter key="kappa" value="false"/><br> <parameter key="weighted_mean_recall" value="false"/><br> <parameter key="weighted_mean_precision" value="false"/><br> <parameter key="spearman_rho" value="false"/><br> <parameter key="kendall_tau" value="false"/><br> <parameter key="absolute_error" value="false"/><br> <parameter key="relative_error" value="false"/><br> <parameter key="relative_error_lenient" value="false"/><br> <parameter key="relative_error_strict" value="false"/><br> <parameter key="normalized_absolute_error" value="false"/><br> <parameter key="root_mean_squared_error" value="false"/><br> <parameter key="root_relative_squared_error" value="false"/><br> <parameter key="squared_error" value="false"/><br> <parameter key="correlation" value="false"/><br> <parameter key="squared_correlation" value="false"/><br> <parameter key="cross-entropy" value="false"/><br> <parameter key="margin" value="false"/><br> <parameter key="soft_margin_loss" value="false"/><br> <parameter key="logistic_loss" value="false"/><br> <parameter key="skip_undefined_labels" value="true"/><br> <parameter key="use_example_weights" value="true"/><br> <list key="class_weights"/><br> </operator><br> <connect from_port="model" to_op="Apply Model (2)" to_port="model"/><br> <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/><br> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/><br> <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/><br> <portSpacing port="source_model" spacing="0"/><br> <portSpacing port="source_test set" spacing="0"/><br> <portSpacing port="source_through 1" spacing="0"/><br> <portSpacing port="sink_test set results" spacing="0"/><br> <portSpacing port="sink_performance 1" spacing="0"/><br> <portSpacing port="sink_performance 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="example set source" to_op="Cross Validation (2)" to_port="example set"/><br> <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="performance sink"/><br> <portSpacing port="source_example set source" spacing="0"/><br> <portSpacing port="sink_performance sink" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Retrieve Titanic" from_port="output" to_op="Select Attributes" to_port="example set input"/><br> <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/><br> <connect from_op="Set Role" from_port="example set output" to_op="Automatic Feature Engineering" to_port="example set in"/><br> <connect from_op="Automatic Feature Engineering" from_port="feature set" to_port="result 1"/><br> <connect from_op="Automatic Feature Engineering" from_port="population" to_port="result 2"/><br> <connect from_op="Automatic Feature Engineering" from_port="optimization log" to_port="result 3"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> <portSpacing port="sink_result 4" spacing="0"/><br> </process><br> </operator><br></process><br>
5
Answers
-
Hi @lionelderkrikor,Ok, now to part 2 of the comments. Thanks again BTW.1) "Optimize Selection (Evolutionary) operator vs AFE operator - If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2"No, they are actually not the same. The new operator uses the same basic concepts but different techniques for selection, mutation, and generation. It also uses some improved heuristics for stopping criteria and added multistarts which should lead to better results faster in most cases. "Most cases" since those are still randomized heuristics so there are no guarantees but it worked very well on the 20+ test data sets we have been analyzing and comparing and never showed statistically significant poorer performances (but sometimes performed significantly better).In addition, there seems to be a bug (see below) in the final model selection which does not always occur but does in your test case (see below and also the other thread on the "shift" issue).2) "Unexpected results with the "balance for accuracy" parameter of the AFE operator"I am 99% sure that this is the result of the "shifting" bug which sometimes occur during the model selection. You can see the same problem in the visualization of the Pareto front in AM as you have pointed out before.3) "The tutorial associated to the AFE operator is broken : there are missing links between some operators..."Yes, thanks. This has already been fixed in the recent development build and will be part of the next release.4) "Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated?"Exactly. Well, not necessarily cross-validated but at least validated on a test set at all. The inner performance is the "training error" of the feature engineering. As you know I am a strong believer that looking after training errors is a sure recipe for disaster which is why we do not deliver it outside here to avoid problems with it in the first place. If you absolutely want to see it, you can use the the third port which all the logged results or use the logging mechanism of RapidMiner. So we do not hide it, we just make it a bit harder to misuse it ;-)Hope this helps and we will certainly have a look into the shifting bug (point 2 above) asap.Thanks,
Ingo5 -
1
-
Hi,Ok, we have looked into this again. So it turned out that those have in fact been two different issues after all. One was a problem with the ordering of the individuals in the Pareto front which in certain circumstances could lead to a shifted selection of individuals (most notably visible in the results of AM which is why I will comment in the other thread here on that in a minute: https://community.rapidminer.com/discussion/54284/rm-9-1-feedback-lets-talk-of-the-new-automatic-feature-engineering-fs#latest)The other issue is the problem with the "wrong" selection based on the bias. The reason for that is quite simple: you have used "accuracy" as the main criterion in your process but the AFE operator requires the inner performance to deliver an error rate, i.e. something which is minimized, not maximized. Although it was stated (somewhat hidden) in the documentation of the operator, this was definitely a bit hidden and we have improved the documentation on that.Both issues together have been leading to the behavior you have observed.Thanks again for pointing these things out. The shifting bug fix and the updated documentation will both be part of the next release (beta starts soon already).Best,
Ingo2 -
BTW, here is a somewhat simplified process based on yours which uses classification error instead of accuracy. However, without the shifting bug fix this can still lead to weird behaviors in certain situations.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="45" y="34"><br> <parameter key="repository_entry" value="//Samples/data/Titanic"/><br> </operator><br> <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34"><br> <parameter key="attribute_filter_type" value="subset"/><br> <parameter key="attribute" value=""/><br> <parameter key="attributes" value="Ticket Number|Name|Cabin"/><br> <parameter key="use_except_expression" value="false"/><br> <parameter key="value_type" value="attribute_value"/><br> <parameter key="use_value_type_exception" value="false"/><br> <parameter key="except_value_type" value="time"/><br> <parameter key="block_type" value="attribute_block"/><br> <parameter key="use_block_type_exception" value="false"/><br> <parameter key="except_block_type" value="value_matrix_row_start"/><br> <parameter key="invert_selection" value="true"/><br> <parameter key="include_special_attributes" value="false"/><br> </operator><br> <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34"><br> <parameter key="attribute_name" value="Survived"/><br> <parameter key="target_role" value="label"/><br> <list key="set_additional_roles"/><br> </operator><br> <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.1.001-SNAPSHOT" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="447" y="34"><br> <parameter key="mode" value="feature selection"/><br> <parameter key="balance for accuracy" value="1.0"/><br> <parameter key="show progress dialog" value="false"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="use optimization heuristics" value="true"/><br> <parameter key="maximum generations" value="30"/><br> <parameter key="population size" value="10"/><br> <parameter key="use multi-starts" value="true"/><br> <parameter key="number of multi-starts" value="5"/><br> <parameter key="generations until multi-start" value="10"/><br> <parameter key="use time limit" value="false"/><br> <parameter key="time limit in seconds" value="60"/><br> <parameter key="use subset for generation" value="false"/><br> <parameter key="maximum function complexity" value="10"/><br> <parameter key="use_plus" value="false"/><br> <parameter key="use_diff" value="false"/><br> <parameter key="use_mult" value="true"/><br> <parameter key="use_div" value="true"/><br> <parameter key="reciprocal_value" value="true"/><br> <parameter key="use_square_roots" value="false"/><br> <parameter key="use_exp" value="false"/><br> <parameter key="use_log" value="false"/><br> <parameter key="use_absolute_values" value="false"/><br> <parameter key="use_sgn" value="false"/><br> <parameter key="use_min" value="false"/><br> <parameter key="use_max" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="45" y="34"><br> <parameter key="split_on_batch_attribute" value="false"/><br> <parameter key="leave_one_out" value="false"/><br> <parameter key="number_of_folds" value="10"/><br> <parameter key="sampling_type" value="automatic"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="85"><br> <parameter key="criterion" value="gain_ratio"/><br> <parameter key="maximal_depth" value="10"/><br> <parameter key="apply_pruning" value="true"/><br> <parameter key="confidence" value="0.1"/><br> <parameter key="apply_prepruning" value="true"/><br> <parameter key="minimal_gain" value="0.01"/><br> <parameter key="minimal_leaf_size" value="2"/><br> <parameter key="minimal_size_for_split" value="4"/><br> <parameter key="number_of_prepruning_alternatives" value="3"/><br> </operator><br> <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/><br> <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/><br> <portSpacing port="source_training set" spacing="0"/><br> <portSpacing port="sink_model" spacing="0"/><br> <portSpacing port="sink_through 1" spacing="0"/><br> </process><br> <process expanded="true"><br> <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34"><br> <list key="application_parameters"/><br> <parameter key="create_view" value="false"/><br> </operator><br> <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34"><br> <parameter key="main_criterion" value="first"/><br> <parameter key="accuracy" value="false"/><br> <parameter key="classification_error" value="true"/><br> <parameter key="kappa" value="false"/><br> <parameter key="weighted_mean_recall" value="false"/><br> <parameter key="weighted_mean_precision" value="false"/><br> <parameter key="spearman_rho" value="false"/><br> <parameter key="kendall_tau" value="false"/><br> <parameter key="absolute_error" value="false"/><br> <parameter key="relative_error" value="false"/><br> <parameter key="relative_error_lenient" value="false"/><br> <parameter key="relative_error_strict" value="false"/><br> <parameter key="normalized_absolute_error" value="false"/><br> <parameter key="root_mean_squared_error" value="false"/><br> <parameter key="root_relative_squared_error" value="false"/><br> <parameter key="squared_error" value="false"/><br> <parameter key="correlation" value="false"/><br> <parameter key="squared_correlation" value="false"/><br> <parameter key="cross-entropy" value="false"/><br> <parameter key="margin" value="false"/><br> <parameter key="soft_margin_loss" value="false"/><br> <parameter key="logistic_loss" value="false"/><br> <parameter key="skip_undefined_labels" value="true"/><br> <parameter key="use_example_weights" value="true"/><br> <list key="class_weights"/><br> </operator><br> <connect from_port="model" to_op="Apply Model (2)" to_port="model"/><br> <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/><br> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/><br> <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/><br> <portSpacing port="source_model" spacing="0"/><br> <portSpacing port="source_test set" spacing="0"/><br> <portSpacing port="source_through 1" spacing="0"/><br> <portSpacing port="sink_test set results" spacing="0"/><br> <portSpacing port="sink_performance 1" spacing="0"/><br> <portSpacing port="sink_performance 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="example set source" to_op="Cross Validation (2)" to_port="example set"/><br> <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="performance sink"/><br> <portSpacing port="source_example set source" spacing="0"/><br> <portSpacing port="sink_performance sink" spacing="0"/><br> </process><br> </operator><br> <connect from_op="Retrieve Titanic" from_port="output" to_op="Select Attributes" to_port="example set input"/><br> <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/><br> <connect from_op="Set Role" from_port="example set output" to_op="Automatic Feature Engineering" to_port="example set in"/><br> <connect from_op="Automatic Feature Engineering" from_port="feature set" to_port="result 1"/><br> <connect from_op="Automatic Feature Engineering" from_port="population" to_port="result 2"/><br> <connect from_op="Automatic Feature Engineering" from_port="optimization log" to_port="result 3"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <portSpacing port="sink_result 3" spacing="0"/><br> <portSpacing port="sink_result 4" spacing="0"/><br> </process><br> </operator><br></process><br>
5