Feature Request: Explain Predictions: Select the top X attributes by importance for the imp port

christos_karras · March 2020

It seems there's a missing feature with Explain Predictions: the "imp" output port always returns the importance of all attributes for all examples. I would like to get only the top X for both positive and negative importance for visualization. The "exa" port uses the "maximal explaining attributes" parameter for this, but it does not return the information in a format adequate for the visualization I'm trying to build.

I experimented with a few convoluted solutions to retrieve this list of top X attributes by group but I can't get to a reasonably simple solution.

I was trying to do something similar to this SQL query but did not find a simple solution to implement it:

-- top 5 positive importances
SELECT * FROM table
WHERE row_number() OVER (PARTITION BY GroupingColumn1, GroupingColumn2) ORDER BY (Importance DESC) <= 5
UNION
-- top 5 negative importances
SELECT * FROM table
WHERE row_number() OVER (PARTITION BY GroupingColumn1, GroupingColumn2) ORDER BY (Importance ASC) <= 5

What would be the simplest way to do this in RapidMiner?

I would also like to create a feature request to add a boolean option "Apply maximal explaining attributes to the imp output port" to the Explain Predictions operator to avoid the need to implement this kind of filtering in the future.

MartinLiebig · March 2020

Hi @christos_karras ,

here is a process which does it. It's not too crazy, so i am not sure if this justifies a new parameter.

Cheers,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/>
      <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
        <parameter key="maximal explaining attributes" value="3"/>
        <parameter key="local sample size" value="500"/>
        <parameter key="only create predictions" value="false"/>
        <parameter key="normalize global weights" value="false"/>
        <parameter key="sort_weights" value="true"/>
        <parameter key="sort_direction" value="descending"/>
      </operator>
      <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="136">
        <parameter key="group_by_attribute" value="Row No"/>
        <parameter key="group_by_attribute (numerical)" value=""/>
        <parameter key="sorting_order" value="none"/>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="849" y="136">
        <parameter key="set_iteration_macro" value="false"/>
        <parameter key="macro_name" value="iteration"/>
        <parameter key="macro_start_value" value="1"/>
        <parameter key="unfold" value="false"/>
        <process expanded="true">
          <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
            <parameter key="attribute_name" value="Value"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="3"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <connect from_port="single" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="983" y="136">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
      <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
      <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
      <connect from_op="Explain Predictions" from_port="importances output" to_op="Group Into Collection" to_port="exa"/>
      <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

sgenzer · March 2020

yup I agree with @mschmitz - pretty easy to do this with a few operators. You could also just wrap these into one subprocess + then turn it into a building block or a new "custom operator"

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/>
      <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
        <parameter key="maximal explaining attributes" value="3"/>
        <parameter key="local sample size" value="500"/>
        <parameter key="only create predictions" value="false"/>
        <parameter key="normalize global weights" value="false"/>
        <parameter key="sort_weights" value="true"/>
        <parameter key="sort_direction" value="descending"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="9.6.000" expanded="true" height="82" name="Subprocess" width="90" x="715" y="85">
        <process expanded="true">
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.3.000" expanded="true" height="82" name="Group Into Collection" width="90" x="45" y="34">
            <parameter key="group_by_attribute" value="Row No"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
                <parameter key="attribute_name" value="Value"/>
                <parameter key="sorting_direction" value="decreasing"/>
              </operator>
              <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
                <parameter key="first_example" value="1"/>
                <parameter key="last_example" value="3"/>
                <parameter key="invert_filter" value="false"/>
              </operator>
              <connect from_port="single" to_op="Sort" to_port="example set input"/>
              <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="313" y="34">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <connect from_port="in 1" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">turn this into a building block or custom operator</description>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
      <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
      <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
      <connect from_op="Explain Predictions" from_port="importances output" to_op="Subprocess" to_port="in 1"/>
      <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

MartinLiebig · March 2020

Hi @christos_karras ,

here is a process which does it. It's not too crazy, so i am not sure if this justifies a new parameter.

Cheers,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/>
      <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
        <parameter key="maximal explaining attributes" value="3"/>
        <parameter key="local sample size" value="500"/>
        <parameter key="only create predictions" value="false"/>
        <parameter key="normalize global weights" value="false"/>
        <parameter key="sort_weights" value="true"/>
        <parameter key="sort_direction" value="descending"/>
      </operator>
      <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="136">
        <parameter key="group_by_attribute" value="Row No"/>
        <parameter key="group_by_attribute (numerical)" value=""/>
        <parameter key="sorting_order" value="none"/>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="849" y="136">
        <parameter key="set_iteration_macro" value="false"/>
        <parameter key="macro_name" value="iteration"/>
        <parameter key="macro_start_value" value="1"/>
        <parameter key="unfold" value="false"/>
        <process expanded="true">
          <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
            <parameter key="attribute_name" value="Value"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="3"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <connect from_port="single" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="983" y="136">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
      <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
      <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
      <connect from_op="Explain Predictions" from_port="importances output" to_op="Group Into Collection" to_port="exa"/>
      <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

sgenzer · March 2020

yup I agree with @mschmitz - pretty easy to do this with a few operators. You could also just wrap these into one subprocess + then turn it into a building block or a new "custom operator"

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="85">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="106"/>
      <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
        <parameter key="maximal explaining attributes" value="3"/>
        <parameter key="local sample size" value="500"/>
        <parameter key="only create predictions" value="false"/>
        <parameter key="normalize global weights" value="false"/>
        <parameter key="sort_weights" value="true"/>
        <parameter key="sort_direction" value="descending"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="9.6.000" expanded="true" height="82" name="Subprocess" width="90" x="715" y="85">
        <process expanded="true">
          <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.3.000" expanded="true" height="82" name="Group Into Collection" width="90" x="45" y="34">
            <parameter key="group_by_attribute" value="Row No"/>
            <parameter key="group_by_attribute (numerical)" value=""/>
            <parameter key="sorting_order" value="none"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
                <parameter key="attribute_name" value="Value"/>
                <parameter key="sorting_direction" value="decreasing"/>
              </operator>
              <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
                <parameter key="first_example" value="1"/>
                <parameter key="last_example" value="3"/>
                <parameter key="invert_filter" value="false"/>
              </operator>
              <connect from_port="single" to_op="Sort" to_port="example set input"/>
              <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="313" y="34">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <connect from_port="in 1" to_op="Group Into Collection" to_port="exa"/>
          <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">turn this into a building block or custom operator</description>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
      <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
      <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
      <connect from_op="Explain Predictions" from_port="importances output" to_op="Subprocess" to_port="in 1"/>
      <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

christos_karras · March 2020

mschmitz and @sgenzer, that's much simpler than what I was trying

sgenzer · March 2020

#occamsrazor

christos_karras · March 2020

Hi @mschmitz and @sgenzer,

After using the suggested solution for a few weeks, I would like to re-express the need for a dedicated option for this (which is probably feasible since it is already available for other outputs of this operator).

Although the suggested solutions work and are simple enough, I found that they do not scale well to larger datasets. I tried running explain predictions in a ~80k rows and ~70 columns dataset.

The explain predictions operator itself ran under a reasonsable time, but the filter to keep only the top 5 important explaining attributes per example was taking too long to run: after 14 hours, it was only 25% done and therefore I terminated it. After more investigation, it seems the "Append" operator is the one taking the longest to execute under these conditions.

Also, some of our users have a Professional license, which is limited to 100,000 data rows, so I assume this limit will apply in this case as well, which is a problem because the data that is actually used (for example top 5 explaining attributes for each prediction) would be below that limit.

This is probably not a situation that will happen on a production model (it will be used on smaller datasets covering at most a few weeks of recent data), but I'm trying to use this feature to help investigate cases where the model was wrong in a dataset with several years of data. So I could probably filter to only keep the examples where the model was the most wrong, but it's also useful to have examples where the model was right for comparison.

Thanks

MartinLiebig · March 2020

Hi @christos_karras ,

i just checked it. There seems to be a bug in the tpyes of the attributes. All attributes are tagged as nominal. Here is an updated version which runs reasonably well on 100k of random data.

Please let us know how this goes. If this works we can file a bug report to fix this.

Best,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="false" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve Golf" width="90" x="112" y="289">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="generate_data" compatibility="9.6.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="85">
        <parameter key="target_function" value="random"/>
        <parameter key="number_examples" value="10000"/>
        <parameter key="number_of_attributes" value="5"/>
        <parameter key="attributes_lower_bound" value="-10.0"/>
        <parameter key="attributes_upper_bound" value="10.0"/>
        <parameter key="gaussian_standard_deviation" value="10.0"/>
        <parameter key="largest_radius" value="10.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <description align="center" color="transparent" colored="false" width="126">Maybe Change to 100k for testing.</description>
      </operator>
      <operator activated="true" class="h2o:generalized_linear_model" compatibility="9.3.001" expanded="true" height="124" name="Generalized Linear Model" width="90" x="313" y="85">
        <parameter key="family" value="AUTO"/>
        <parameter key="link" value="family_default"/>
        <parameter key="solver" value="AUTO"/>
        <parameter key="reproducible" value="false"/>
        <parameter key="maximum_number_of_threads" value="4"/>
        <parameter key="use_regularization" value="true"/>
        <parameter key="lambda_search" value="false"/>
        <parameter key="number_of_lambdas" value="0"/>
        <parameter key="lambda_min_ratio" value="0.0"/>
        <parameter key="early_stopping" value="true"/>
        <parameter key="stopping_rounds" value="3"/>
        <parameter key="stopping_tolerance" value="0.001"/>
        <parameter key="standardize" value="true"/>
        <parameter key="non-negative_coefficients" value="false"/>
        <parameter key="add_intercept" value="true"/>
        <parameter key="compute_p-values" value="false"/>
        <parameter key="remove_collinear_columns" value="false"/>
        <parameter key="missing_values_handling" value="MeanImputation"/>
        <parameter key="max_iterations" value="0"/>
        <parameter key="specify_beta_constraints" value="false"/>
        <list key="beta_constraints"/>
        <parameter key="max_runtime_seconds" value="0"/>
        <list key="expert_parameters"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="9.6.000" expanded="true" height="103" name="Multiply" width="90" x="447" y="136"/>
      <operator activated="true" class="model_simulator:explain_predictions" compatibility="9.6.000" expanded="true" height="124" name="Explain Predictions" width="90" x="581" y="85">
        <parameter key="maximal explaining attributes" value="3"/>
        <parameter key="local sample size" value="500"/>
        <parameter key="only create predictions" value="false"/>
        <parameter key="normalize global weights" value="false"/>
        <parameter key="sort_weights" value="true"/>
        <parameter key="sort_direction" value="descending"/>
      </operator>
      <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="2.4.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="136">
        <parameter key="group_by_attribute" value="Row No"/>
        <parameter key="group_by_attribute (numerical)" value=""/>
        <parameter key="sorting_order" value="none"/>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="916" y="136">
        <parameter key="set_iteration_macro" value="false"/>
        <parameter key="macro_name" value="iteration"/>
        <parameter key="macro_start_value" value="1"/>
        <parameter key="unfold" value="false"/>
        <process expanded="true">
          <operator activated="true" class="parse_numbers" compatibility="9.6.000" expanded="true" height="82" name="Parse Numbers" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="Name"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="infinity_representation" value=""/>
            <parameter key="unparsable_value_handling" value="fail"/>
          </operator>
          <operator activated="true" class="sort" compatibility="9.6.000" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
            <parameter key="attribute_name" value="Value"/>
            <parameter key="sorting_direction" value="decreasing"/>
          </operator>
          <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="34">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="3"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <connect from_port="single" to_op="Parse Numbers" to_port="example set input"/>
          <connect from_op="Parse Numbers" from_port="example set output" to_op="Sort" to_port="example set input"/>
          <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="9.6.000" expanded="true" height="82" name="Append" width="90" x="1050" y="136">
        <parameter key="datamanagement" value="double_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="merge_type" value="all"/>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generalized Linear Model" to_port="training set"/>
      <connect from_op="Generalized Linear Model" from_port="model" to_op="Explain Predictions" to_port="model"/>
      <connect from_op="Generalized Linear Model" from_port="exampleSet" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Explain Predictions" to_port="training data"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Explain Predictions" to_port="test data"/>
      <connect from_op="Explain Predictions" from_port="importances output" to_op="Group Into Collection" to_port="exa"/>
      <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

christos_karras · March 2020

Thank you @mschmitz, I will try that with my data once I have some spare resources to test this again, so probably later this week, and let you know the results. For now, I ran your process with 1,000,000 generated examples. The "Loop Collection" took approximately 15 minutes to run, possibly due to the added Parse Numbers operator. Append then ran for another 10 minutes approximately. That still seems long relatively to the time the Explain Predictions operator took to run (I don't have the timing but it was less than the Loop Collection and Append steps).

My other concern is about users with professional licenses. For example, in the following scenario, what would happen to a user with a Professional license which is limited to 100,000 rows:
* Explain Predictions is used on a test data set of 10,000 rows and 30 columns
* Outputs of the "imp" port has 300,000 rows
* This exceeds the 100,000 row limit of the license: what happens here?
* After filtering the 3 most important features for each data row, the results now have 30,000 rows, which is acceptable for the license

Thanks

MartinLiebig · March 2020

Hi @christos_karras ,

lets put this offline and find a solid solution for you. Maybe with some custom code.

Best,

Martin

IngoRM · April 2020

Hi @christos_karras & @mschmitz
Not sure if you guys had the chance to connect. But I wanted to let you know that a new parameter "apply maximum to importances output" will be a part of the upcoming 9.7 release. So there is no need for the postprocessing then any longer.
Also the process above actually had two errors since it sorted according to the column "Value" but it should have been "Importance". And also it should actually not sort according to "Importance" but with respect to the absolute value of the importance. The process below fixes both problems and also contains the alternative path with the new parameter (which obviously will only work for you guys after the 9.7 release).
Hope this helps,
Ingo

<?xml version="1.0" encoding="UTF-8"?><process version="9.7.000-SNAPSHOT">

</context>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

</operator>

</operator>

</process>

</operator>

</operator>

</operator>

</process>

<description align="center" color="transparent" colored="false" width="126">turn this into a building block or custom operator</description>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

christos_karras · April 2020

Hi @IngoRM,

Yes I quickly discussed this with @mschmitz at the end of a call but but did not have much time to complete the discussion. It seems that what was implemented for version 9.7 will be an appropriate solution. Thanks!

Feature Request: Explain Predictions: Select the top X attributes by importance for the imp port

Best Answers

Answers

Categories