XValidation reporting; optimization for specific performance?

sj721
sj721 New Altair Community Member
edited November 5 in Community Q&A
I'm a relatively new user to rapidminer and I have two questions:

question 1
I'm using an optimization block, and a cross validated decision tree within. In the final output performance, under accuracy, it seems to be counting all samples. I'm interested in measuring performance on the test set only.

question 2
What's the optimizer's target? It seems to be working on a blackbox average of performance. I'm hoping to optimize for recall and precision, in that order.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="467" width="614">
      <operator activated="true" class="read_csv" compatibility="5.2.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="75">
        <parameter key="csv_file" value="/home/sam/Documents/COMP764/ads/processors/000.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="UTF-8"/>
        <list key="data_set_meta_data_information">
          <parameter key="1" value="text_ads.true.integer.attribute"/>
          <parameter key="2" value="image_ads.true.integer.attribute"/>
          <parameter key="3" value="flash_ads.true.integer.attribute"/>
          <parameter key="5" value="html_images.true.integer.attribute"/>
          <parameter key="6" value="html_images_non_small.true.integer.attribute"/>
          <parameter key="7" value="html_objects.true.integer.attribute"/>
          <parameter key="8" value="html_word_chars.true.integer.attribute"/>
          <parameter key="9" value="html_words.true.integer.attribute"/>
          <parameter key="10" value="html_words_unique.true.integer.attribute"/>
          <parameter key="11" value="num_popups.true.integer.attribute"/>
          <parameter key="12" value="label.true.binominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="normalize" compatibility="5.2.008" expanded="true" height="94" name="Normalize" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="value_type"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="313" y="75">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="5.2.008" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="45" y="300">
        <list key="parameters">
          <parameter key="Decision Tree.minimal_size_for_split" value="[3;10;10;linear]"/>
          <parameter key="Decision Tree.minimal_gain" value="[0.1;2;10;linear]"/>
          <parameter key="Decision Tree.maximal_depth" value="[5;30;10;linear]"/>
        </list>
        <process expanded="true" height="521" width="760">
          <operator activated="true" class="x_validation" compatibility="5.2.008" expanded="true" height="112" name="Validation (2)" width="90" x="179" y="120">
            <parameter key="number_of_validations" value="3"/>
            <parameter key="sampling_type" value="shuffled sampling"/>
            <process expanded="true" height="521" width="355">
              <operator activated="false" class="support_vector_machine" compatibility="5.2.008" expanded="true" height="112" name="SVM" width="90" x="45" y="30"/>
              <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="112" y="165">
                <parameter key="minimal_size_for_split" value="10"/>
                <parameter key="minimal_gain" value="2.0"/>
                <parameter key="maximal_depth" value="30"/>
              </operator>
              <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="521" width="355">
              <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="false" class="performance_classification" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="45" y="165">
                <parameter key="main_criterion" value="accuracy"/>
                <parameter key="weighted_mean_recall" value="true"/>
                <parameter key="weighted_mean_precision" value="true"/>
                <list key="class_weights"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.2.008" expanded="true" height="76" name="Performance (2)" width="90" x="179" y="165"/>
              <operator activated="true" class="log" compatibility="5.2.008" expanded="true" height="76" name="Log" width="90" x="179" y="30">
                <list key="log">
                  <parameter key="min spli tsize" value="operator.Decision Tree.parameter.minimal_size_for_split"/>
                  <parameter key="min gain" value="operator.Decision Tree.parameter.minimal_gain"/>
                  <parameter key="max depth" value="operator.Decision Tree.parameter.maximal_depth"/>
                  <parameter key="performance" value="operator.Validation (2).value.performance"/>
                </list>
              </operator>
              <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
              <connect from_op="Performance (2)" from_port="performance" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Validation (2)" to_port="training"/>
          <connect from_op="Validation (2)" from_port="model" to_port="result 1"/>
          <connect from_op="Validation (2)" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 3"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>



thanks
Tagged:

Answers

  • Skirzynski
    Skirzynski New Altair Community Member
    sj721 wrote:

    question 1
    I'm using an optimization block, and a cross validated decision tree within. In the final output performance, under accuracy, it seems to be counting all samples. I'm interested in measuring performance on the test set only.
    The good thing about cross-validation is that every example is used in a test set once. So the measured performance can be computed for all examples without testing with the train data.
    sj721 wrote:

    question 2
    What's the optimizer's target? It seems to be working on a blackbox average of performance. I'm hoping to optimize for recall and precision, in that order.
    This depends on the performance operator you are using. The simple performance-operator in your example hasn't any parameters for this purpose (it is using accuracy I think). If you use the "Performance (Classification)" operator for instance (you have disabled it in your example process) you can select the "main criterion" which will be used if two parameters will be compared. Additionally it is possible to combine criteria with the "Combine Performances" operator and adjust the importance of a criteria with the criteria weight.