inconsistent metadata subsequent to apply model operator?

It seems that if one checks the meta data in the data flow following the use of the
apply model operator, one finds it to be inconsistent. More precisely, in one of my experiments I was expecting
three more columns to be mentioned in the metadata: the confidences for the classes
Yes and No, and the prediction itself. None was included in the metadata, although all were included in the
scored dataset. In particular the two confidences were not visible to a select attribute operator via which I intended to drop them before storing the scored dataset in a database.

Any comments are welcome.

Best
Dan

Find more posts tagged with

AI Studio

Accepted answers

All comments

haddock

It doesn't happen with the following code - so perhaps you should post yours.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="">
    <process expanded="true" height="365" width="748">
      <operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve" width="90" x="179" y="75">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="loop_attribute_subsets" compatibility="5.0.8" expanded="true" height="60" name="Loop Subsets" width="90" x="380" y="75">
        <parameter key="use_exact_number" value="true"/>
        <parameter key="exact_number_of_attributes" value="2"/>
        <process expanded="true" height="380" width="815">
          <operator activated="true" class="x_validation" compatibility="5.0.8" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
            <process expanded="true" height="380" width="438">
              <operator activated="true" class="decision_tree" compatibility="5.0.8" expanded="true" height="76" name="Decision Tree" width="90" x="95" y="144"/>
              <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="380" width="438">
              <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="25" y="47">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance_classification" compatibility="5.0.8" expanded="true" height="76" name="Performance" width="90" x="231" y="38">
                <list key="class_weights"/>
              </operator>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="5.0.8" expanded="true" height="76" name="Log" width="90" x="581" y="30">
            <list key="log">
              <parameter key="Attributes" value="operator.Loop Subsets.value.feature_names"/>
              <parameter key="Performance" value="operator.Validation.value.performance"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
          <portSpacing port="source_example set" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="log_to_data" compatibility="5.0.8" expanded="true" height="94" name="Log to Data" width="90" x="571" y="73"/>
      <connect from_op="Retrieve" from_port="output" to_op="Loop Subsets" to_port="example set"/>
      <connect from_op="Loop Subsets" from_port="example set" to_op="Log to Data" to_port="through 1"/>
      <connect from_op="Log to Data" from_port="exampleSet" to_port="result 1"/>
      <connect from_op="Log to Data" from_port="through 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

land

Hi,
again I have to remind that the meta data is only executed on a dry run without taking a look at the actual data. If the label attribute's values are not known during meta data transformation, they cannot be inserted during the dry model application. The model's meta data simply don't know on which class values it has been created.

Greetings,
Sebastian

dan_agape

Both, thanks.

For seek of simplicity assume you learn a decision tree from the Iris dataset, you write the model in a file, then you want to use it for scoring the Iris dataset, and you want to drop the soft predictions/ confidence attributes from the scored dataset via the select attributes operator, as shown below.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <parameter key="logverbosity" value="3"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="1"/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true" height="550" width="480">
      <operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="165">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="read_model" compatibility="5.0.8" expanded="true" height="60" name="Read Model" width="90" x="45" y="75">
        <parameter key="model_file" value="C:\Users\dan\Desktop\model_DT.mod"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="179" y="75">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.0.8" expanded="true" height="76" name="Select Attributes" width="90" x="313" y="75">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="0"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="11"/>
        <parameter key="block_type" value="0"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="8"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Read Model" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="36"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

How can one alternatively drop the confidence attributes, as it seems that these and the prediction attribute are not visible by the select attribute operator, which is practically not useful here?

Obviously if one included, in the canvas, the process that built the model instead of reading it from a file, then the meta data would have contained these confidence attributes and thus they would have been visible and could have been discarded with the select attributes operator (before saving the scored dataset). But ... the model is just available in a file (as in a real application for instance the model may have been built through running several experiments until it is satisfactory).

Thanks
Dan

haddock

How can one alternatively drop the confidence attributes, as it seems that these and the prediction attribute are not visible by the select attribute operator, which is practically not useful here?

Really? Try reading the help..

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="550" width="547">
      <operator activated="true" class="retrieve" compatibility="5.0.8" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="165">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="k_nn" compatibility="5.0.8" expanded="true" height="76" name="k-NN" width="90" x="173" y="155"/>
      <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="313" y="75">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.0.8" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="210">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="regular_expression" value="pred.*"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <connect from_op="Retrieve (2)" from_port="output" to_op="k-NN" to_port="training set"/>
      <connect from_op="k-NN" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="k-NN" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="36"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

land

Hi Dan,
if you would save the model to the repository instead of a file, the meta data would be preserved. That's exactly why we introduced the repository in the first place...
Furthermore if you are going to make a process productive, you won't even want to start RapidMiner for the process results. We designed RapidAnalytics to solve this issue: You can run processes there either manually, but also by time schedule or expose them as a webservice. The latter gives you a very easy way of integrating RM into your IT infrastructure. Several formats for delivering the results are supported including xml, jason or directly delivering the plot.
For all of this, RapidAnalytics supports you with a so called "remote repository" that can be accessed by all team members, supporting User Rights. Since that, we regard exporting things into files as the old fashioned absolute baseline solution that's no longer the preferred way.

And if you really want to have control over the files, then take a look at the local repository directory: The content is simply stored in files, but the repository connects it with the meta data.

Greetings,
Sebastian

dan_agape

Sebastian, thanks, excellent and very helpful advice!

Best,
Dan