Decision tree and RapidMiner performance measures - how to understand them

Picia
Picia New Altair Community Member
edited November 2024 in Community Q&A
I would like to ask for help in the following matter.
In a decision tree created with gain ratio I just receive the classification of every instance to some class. In my case, one of 2 classes.
I do not understand how the RMSE is calculated if this measure is based on the difference between actual value and predicted value. If my classes use index symbols 0 and 1, does it mean that always the difference is 0 or 1 between actual value and predicted value?
Similarly, I do not undestand the margin definition. The margin is defined as the minimal confidence for the correct label. Should I calculate confidence for all the nodes and take the minimum value?
Finally, I do not understand the soft margin.Soft margin loss is the average soft margin loss on a classifier defined as the average of all 1- confidences for the correct label. How do I caculate 1-confidence for the correct label? 


Best Answer

  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Picia,

    Thanks for the followup and clarifications. Yes, the gain ratio is the right criteria to use for classification trees.

    If you are interested in the method used to calculate RMSE, Margin, Soft margin for classification performances, here are the open sourced java scripts behind that

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/RootMeanSquaredError.java

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/Margin.java

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/SoftMarginLoss.java

    Attached is the example process to manually calculate RMSE for the training performance 


    Simply put, the squared error(SE), aka "gap" between the real value (yes or no) and the prediction confidence are formulated in the "Generate Attribute" operator for each instance.
    We use the SE to get MSE (mean squared error) by extracting the average statistics.
    In the end, RMSE is the square root of MSE.
    RMSE = Sqrt(MSE), Where MSE = Sum of Squared Error / N, N is the number of examples
    <?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.5.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.000" expanded="true" height="103" name="Decision Tree" origin="GENERATED_TUTORIAL" width="90" x="447" y="85">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="20"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.25"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.1"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model" width="90" x="648" y="85">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.5.001" expanded="true" height="82" name="Performance" width="90" x="849" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="false"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="true"/>
            <parameter key="squared_correlation" value="true"/>
            <parameter key="cross-entropy" value="true"/>
            <parameter key="margin" value="true"/>
            <parameter key="soft_margin_loss" value="true"/>
            <parameter key="logistic_loss" value="true"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <operator activated="true" breakpoints="after" class="generate_attributes" compatibility="9.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="136">
            <list key="function_descriptions">
              <parameter key="SE" value="if(Survived==&quot;Yes&quot;,(1-[confidence(Yes)])^2,(1-[confidence(No)])^2)"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" breakpoints="after" class="extract_macro" compatibility="9.5.001" expanded="true" height="68" name="Extract Macro" width="90" x="1184" y="136">
            <parameter key="macro" value="MSE"/>
            <parameter key="macro_type" value="statistics"/>
            <parameter key="statistics" value="average"/>
            <parameter key="attribute_name" value="SE"/>
            <list key="additional_macros"/>
            <description align="center" color="transparent" colored="false" width="126">MSE is the sum of squared error / n</description>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="9.5.001" expanded="true" height="82" name="Generate Macro (2)" width="90" x="1318" y="136">
            <list key="function_descriptions">
              <parameter key="RMSE" value="sqrt(eval(%{MSE}))"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Calculate RMSE bassed on squar root of SSE and number of example</description>
          </operator>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro (2)" to_port="through 1"/>
          <connect from_op="Generate Macro (2)" from_port="through 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="21"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    





    HTH!

    YY



Answers

  • YYH
    YYH
    Altair Employee
    edited February 2020
    Hi @Picia,

    If  you have a binary label (0 or 1) for prediction with Decision Tree, the best way is to convert the target type from numeric to nominal and apply "performance (Binomial Classification)" operator to extract the measurements for classification models.
    AUC, classification error, accuracy, recall, F-measurement, ect. are usually the metrics used for Binomial Classification.



    In your example, RMSE is a commonly used error metric to measure the performance of regression models. I am not sure about the definitions of Margin or Soft Margin in the "Performance (Classification)". I will double check with the internal team and update later.
    As a good reference, the log loss is defined here and commonly used in classification with the extra consideration of confidence values.
    -log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))
    https://www.quora.com/What-is-an-intuitive-explanation-for-the-log-loss-function

    Cheers,
    YY
  • Picia
    Picia New Altair Community Member
    edited February 2020
    I did that. My question concerns how technically one calculates the performance measures.
    I have a decision tree which simply classifies instances. I am using gain ratio so I do not think it is a regression tree.
    How do I calculate predicted value and then how do I calculate the difference between the predicted value and actual value.
    Then, how do I calculate the margin and soft margin.
    In a decision tree I see no probabilities associated with an individual instance. The tree simply classifies each instance to some class. So what is the predicted value. What is the margin - some minimum value of confidence from all the nodes in a tree?
    I am using gain ratio to create the tree, but it is only to set the criteria in the nodes (or am I wrong? and I use it somehow to determine the margin or predicted value?).

  • Picia
    Picia New Altair Community Member
    This is the model I am using. Inside of cross validation is a decision tree. I split the data sample and use a separate sample for secondary validation of the trained decision tree on a completely unknown instances.
  • Picia
    Picia New Altair Community Member
    And here are the performance measures which are returned by the performance(2) element. I do not understand how they are calculated because I am using a binominal decision tree, not the regression tree and I have no idea how Performance(2) module calculates RMSE and other measures.

  • Picia
    Picia New Altair Community Member
    This is what I have inside of Cross-Validation module. I am training a binominal decision tree. Performance module which I have here returns only accuracy, precision and recall. This makes sense. I have no idea why the other module Performance(2) returns performance parameters which are suitable for regression trees, but not for binominal tree. I have no idea how it is possible that these measures are calculated.

  • Picia
    Picia New Altair Community Member
    Here is what I see when I point a mouse on "per" entry in the Performance(2) module.
  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Picia,

    Thanks for the followup and clarifications. Yes, the gain ratio is the right criteria to use for classification trees.

    If you are interested in the method used to calculate RMSE, Margin, Soft margin for classification performances, here are the open sourced java scripts behind that

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/RootMeanSquaredError.java

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/Margin.java

    https://github.com/rapidminer/rapidminer-studio/blob/master/src/main/java/com/rapidminer/operator/performance/SoftMarginLoss.java

    Attached is the example process to manually calculate RMSE for the training performance 


    Simply put, the squared error(SE), aka "gap" between the real value (yes or no) and the prediction confidence are formulated in the "Generate Attribute" operator for each instance.
    We use the SE to get MSE (mean squared error) by extracting the average statistics.
    In the end, RMSE is the square root of MSE.
    RMSE = Sqrt(MSE), Where MSE = Sum of Squared Error / N, N is the number of examples
    <?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.5.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="85">
            <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.4.000" expanded="true" height="103" name="Decision Tree" origin="GENERATED_TUTORIAL" width="90" x="447" y="85">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="20"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.25"/>
            <parameter key="apply_prepruning" value="true"/>
            <parameter key="minimal_gain" value="0.1"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.5.001" expanded="true" height="82" name="Apply Model" width="90" x="648" y="85">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_classification" compatibility="9.5.001" expanded="true" height="82" name="Performance" width="90" x="849" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="accuracy" value="true"/>
            <parameter key="classification_error" value="false"/>
            <parameter key="kappa" value="false"/>
            <parameter key="weighted_mean_recall" value="false"/>
            <parameter key="weighted_mean_precision" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="absolute_error" value="false"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="true"/>
            <parameter key="squared_correlation" value="true"/>
            <parameter key="cross-entropy" value="true"/>
            <parameter key="margin" value="true"/>
            <parameter key="soft_margin_loss" value="true"/>
            <parameter key="logistic_loss" value="true"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
            <list key="class_weights"/>
          </operator>
          <operator activated="true" breakpoints="after" class="generate_attributes" compatibility="9.5.001" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="136">
            <list key="function_descriptions">
              <parameter key="SE" value="if(Survived==&quot;Yes&quot;,(1-[confidence(Yes)])^2,(1-[confidence(No)])^2)"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" breakpoints="after" class="extract_macro" compatibility="9.5.001" expanded="true" height="68" name="Extract Macro" width="90" x="1184" y="136">
            <parameter key="macro" value="MSE"/>
            <parameter key="macro_type" value="statistics"/>
            <parameter key="statistics" value="average"/>
            <parameter key="attribute_name" value="SE"/>
            <list key="additional_macros"/>
            <description align="center" color="transparent" colored="false" width="126">MSE is the sum of squared error / n</description>
          </operator>
          <operator activated="true" class="generate_macro" compatibility="9.5.001" expanded="true" height="82" name="Generate Macro (2)" width="90" x="1318" y="136">
            <list key="function_descriptions">
              <parameter key="RMSE" value="sqrt(eval(%{MSE}))"/>
            </list>
            <description align="center" color="transparent" colored="false" width="126">Calculate RMSE bassed on squar root of SSE and number of example</description>
          </operator>
          <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro (2)" to_port="through 1"/>
          <connect from_op="Generate Macro (2)" from_port="through 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="21"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    





    HTH!

    YY



  • Picia
    Picia New Altair Community Member
    edited February 2020
    So if I understand it right, these are the definitions for the performance parameters for the binominal tree?


  • Picia
    Picia New Altair Community Member
    edited February 2020
    I found the setter and getter for confidence in the example class.
     However, if I understand it correctly, the Example class represents only 1 instance from the data set. So for every instance there is a separate value of confidence.
    I do not know how it is calculated for every instance. In the decision tree I can set the confidence level (probably this is the z value from the normal distribution and it is used to calculate confidence for pruning). But if every instance has got its own confidence, then I do not know how it is calculated.