RM 9.1 feedback : Let's talk of the new Automatic Feature Engineering (FS) - Part 2

lionelderkrikor
lionelderkrikor New Altair Community Member
edited November 5 in Community Q&A
Hi,

This topic of feature selection definitely inspires me : 

1/ Optimize Selection (Evolutionary) operator vs AFE operator : 

If I good understand, AFE operator is using an evolutionnary algorithm, so we must, a priori, find the same results with the 2 operators.
It is not the case. For example, here the results with the Titanic dataset and a DT model : 
 - with OS (Evol) ==> acc = 81,20 % / feature set = 5 features
 - with ASE (with "balance for accuracy" = 1)==> acc=  79,07 %  / feature set = 1 feature
Why ASE did not conclude the same feature set and in fine obtains the same performance ?

2/ Unexpected results with the "balance for accuracy" parameter of the AFE operator : 
Always with the Titanic dataset / DT model : 
When we set "Balance for accuracy"  = 0 (so  we expect the simplier feature set) , we obtain the ......original dataset !  : 




and when we set  "Balance for accuracy" = 1 , we obtain  : 


Why this last feature set is not obtained with "balance for accuracy" = 0 ? From my point of view, the resulting feature sets are not 
consistent with the value of "balance for accuracy" parameter...

3/ The tutorial associated to the AFE operator is broken : there are missing links between some operators...

4/ Performance output port of AFE ::  
There is a performance output port inside the AFE operator 




 but there is no performance output port outside the operator :   

Is there any reason to that ? maybe, in practice, the AFE need to be itself cross-validated ?

In conclusion, can you provide some clarifications to all these items ?

Thanks you for your listening,

Regards,

Lionel

NB : The process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve Titanic" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Titanic"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="Ticket Number|Name|Cabin"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.1.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="85">
        <parameter key="attribute_name" value="Survived"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="103" name="Multiply" width="90" x="514" y="85"/>
      <operator activated="true" class="optimize_selection_evolutionary" compatibility="9.1.000" expanded="true" height="103" name="Optimize Selection (Evolutionary)" width="90" x="648" y="85">
        <parameter key="use_exact_number_of_attributes" value="false"/>
        <parameter key="restrict_maximum" value="false"/>
        <parameter key="min_number_of_attributes" value="1"/>
        <parameter key="max_number_of_attributes" value="1"/>
        <parameter key="exact_number_of_attributes" value="1"/>
        <parameter key="initialize_with_input_weights" value="false"/>
        <parameter key="population_size" value="5"/>
        <parameter key="maximum_number_of_generations" value="30"/>
        <parameter key="use_early_stopping" value="false"/>
        <parameter key="generations_without_improval" value="2"/>
        <parameter key="normalize_weights" value="true"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="user_result_individual_selection" value="false"/>
        <parameter key="show_population_plotter" value="false"/>
        <parameter key="plot_generations" value="10"/>
        <parameter key="constraint_draw_range" value="false"/>
        <parameter key="draw_dominated_points" value="true"/>
        <parameter key="maximal_fitness" value="Infinity"/>
        <parameter key="selection_scheme" value="tournament"/>
        <parameter key="tournament_size" value="0.25"/>
        <parameter key="start_temperature" value="1.0"/>
        <parameter key="dynamic_selection_pressure" value="true"/>
        <parameter key="keep_best_individual" value="false"/>
        <parameter key="save_intermediate_weights" value="false"/>
        <parameter key="intermediate_weights_generations" value="10"/>
        <parameter key="p_initialize" value="0.5"/>
        <parameter key="p_mutation" value="-1.0"/>
        <parameter key="p_crossover" value="0.5"/>
        <parameter key="crossover_type" value="uniform"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation" width="90" x="313" y="34">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree" width="90" x="179" y="85">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="kappa" value="false"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="model_simulator:automatic_feature_engineering" compatibility="9.1.000" expanded="true" height="103" name="Automatic Feature Engineering" width="90" x="648" y="289">
        <parameter key="mode" value="feature selection"/>
        <parameter key="balance for accuracy" value="1.0"/>
        <parameter key="show progress dialog" value="false"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="use optimization heuristics" value="true"/>
        <parameter key="maximum generations" value="30"/>
        <parameter key="population size" value="10"/>
        <parameter key="use multi-starts" value="true"/>
        <parameter key="number of multi-starts" value="5"/>
        <parameter key="generations until multi-start" value="10"/>
        <parameter key="use time limit" value="false"/>
        <parameter key="time limit in seconds" value="60"/>
        <parameter key="use subset for generation" value="false"/>
        <parameter key="maximum function complexity" value="10"/>
        <parameter key="use_plus" value="false"/>
        <parameter key="use_diff" value="false"/>
        <parameter key="use_mult" value="true"/>
        <parameter key="use_div" value="true"/>
        <parameter key="reciprocal_value" value="true"/>
        <parameter key="use_square_roots" value="false"/>
        <parameter key="use_exp" value="false"/>
        <parameter key="use_log" value="false"/>
        <parameter key="use_absolute_values" value="false"/>
        <parameter key="use_sgn" value="false"/>
        <parameter key="use_min" value="false"/>
        <parameter key="use_max" value="false"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:cross_validation" compatibility="9.1.000" expanded="true" height="145" name="Cross Validation (2)" width="90" x="313" y="85">
            <parameter key="split_on_batch_attribute" value="false"/>
            <parameter key="leave_one_out" value="false"/>
            <parameter key="number_of_folds" value="10"/>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.1.000" expanded="true" height="103" name="Decision Tree (2)" width="90" x="179" y="85">
                <parameter key="criterion" value="gain_ratio"/>
                <parameter key="maximal_depth" value="10"/>
                <parameter key="apply_pruning" value="true"/>
                <parameter key="confidence" value="0.1"/>
                <parameter key="apply_prepruning" value="true"/>
                <parameter key="minimal_gain" value="0.01"/>
                <parameter key="minimal_leaf_size" value="2"/>
                <parameter key="minimal_size_for_split" value="4"/>
                <parameter key="number_of_prepruning_alternatives" value="3"/>
              </operator>
              <connect from_port="training set" to_op="Decision Tree (2)" to_port="training set"/>
              <connect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="9.1.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="112" y="34">
                <list key="application_parameters"/>
                <parameter key="create_view" value="false"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="9.1.000" expanded="true" height="82" name="Performance (2)" width="90" x="246" y="34">
                <parameter key="main_criterion" value="first"/>
                <parameter key="accuracy" value="true"/>
                <parameter key="classification_error" value="true"/>
                <parameter key="kappa" value="false"/>
                <parameter key="weighted_mean_recall" value="false"/>
                <parameter key="weighted_mean_precision" value="false"/>
                <parameter key="spearman_rho" value="false"/>
                <parameter key="kendall_tau" value="false"/>
                <parameter key="absolute_error" value="false"/>
                <parameter key="relative_error" value="false"/>
                <parameter key="relative_error_lenient" value="false"/>
                <parameter key="relative_error_strict" value="false"/>
                <parameter key="normalized_absolute_error" value="false"/>
                <parameter key="root_mean_squared_error" value="false"/>
                <parameter key="root_relative_squared_error" value="false"/>
                <parameter key="squared_error" value="false"/>
                <parameter key="correlation" value="false"/>
                <parameter key="squared_correlation" value="false"/>
                <parameter key="cross-entropy" value="false"/>
                <parameter key="margin" value="false"/>
                <parameter key="soft_margin_loss" value="false"/>
                <parameter key="logistic_loss" value="false"/>
                <parameter key="skip_undefined_labels" value="true"/>
                <parameter key="use_example_weights" value="true"/>
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="remember" compatibility="9.1.000" expanded="true" height="68" name="Remember" width="90" x="447" y="136">
            <parameter key="name" value="performance"/>
            <parameter key="io_object" value="PerformanceVector"/>
            <parameter key="store_which" value="1"/>
            <parameter key="remove_from_process" value="true"/>
          </operator>
          <connect from_port="example set source" to_op="Cross Validation (2)" to_port="example set"/>
          <connect from_op="Cross Validation (2)" from_port="performance 1" to_op="Remember" to_port="store"/>
          <connect from_op="Remember" from_port="stored" to_port="performance sink"/>
          <portSpacing port="source_example set source" spacing="0"/>
          <portSpacing port="sink_performance sink" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="recall" compatibility="9.1.000" expanded="true" height="68" name="Recall" width="90" x="849" y="340">
        <parameter key="name" value="performance"/>
        <parameter key="io_object" value="PerformanceVector"/>
        <parameter key="remove_from_store" value="true"/>
      </operator>
      <connect from_op="Retrieve Titanic" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Optimize Selection (Evolutionary)" to_port="example set in"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Automatic Feature Engineering" to_port="example set in"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="weights" to_port="result 2"/>
      <connect from_op="Optimize Selection (Evolutionary)" from_port="performance" to_port="result 1"/>
      <connect from_op="Automatic Feature Engineering" from_port="feature set" to_port="result 3"/>
      <connect from_op="Recall" from_port="result" to_port="result 4"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>


Best Answers

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    @IngoRM,

    Thanks you for your time and your answers.

    Regards,

    Lionel
  • IngoRM
    IngoRM New Altair Community Member
    Hi,
    Ok, we have looked into this again.  So it turned out that those have in fact been two different issues after all.  One was a problem with the ordering of the individuals in the Pareto front which in certain circumstances could lead to a shifted selection of individuals (most notably visible in the results of AM which is why I will comment in the other thread here on that in a minute: https://community.rapidminer.com/discussion/54284/rm-9-1-feedback-lets-talk-of-the-new-automatic-feature-engineering-fs#latest)
    The other issue is the problem with the "wrong" selection based on the bias.  The reason for that is quite simple: you have used "accuracy" as the main criterion in your process but the AFE operator requires the inner performance to deliver an error rate, i.e. something which is minimized, not maximized.  Although it was stated (somewhat hidden) in the documentation of the operator, this was definitely a bit hidden and we have improved the documentation on that.
    Both issues together have been leading to the behavior you have observed.
    Thanks again for pointing these things out.  The shifting bug fix and the updated documentation will both be part of the next release (beta starts soon already).
    Best,
    Ingo