Problems discovering which data could not be correctly classified.

the_duckman
the_duckman New Altair Community Member
edited November 5 in Community Q&A
G'Day all,

I was trying something quite strait forward and it just wont work for me.

As per the title, I have some data, to which I trained a classifier, and wish to better understand my situation by examining the records which failed to classify. To (attempt to) do this, I re-apply the trained model to the original training data and use the  "filter examples" operator to select the errant classifications. The problem is that I always get 0 examples returned in the final set.

I've spent the whole day trying to discover what I am doing wrong without any progress and could really use some assistance.

Below is a typical example (adapted to use sample data) of my attempts to do this using the "filter examples" operator,
On both of my machines this fails to discover the errant classification set.

Thanks for any help or Ideas,

-dm

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" expanded="true" name="Process">
   <process expanded="true" height="424" width="681">
     <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="75">
       <parameter key="repository_entry" value="//Samples/data/Sonar"/>
     </operator>
     <operator activated="true" class="select_by_random" expanded="true" height="76" name="Select by Random" width="90" x="45" y="165">
       <parameter key="use_fixed_number_of_attributes" value="true"/>
       <parameter key="number_of_attributes" value="20"/>
     </operator>
     <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="112" y="255"/>
     <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
       <process expanded="true" height="442" width="268">
         <operator activated="true" class="k_nn" expanded="true" height="76" name="k-NN (2)" width="90" x="112" y="75"/>
         <connect from_port="training" to_op="k-NN (2)" to_port="training set"/>
         <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
         <portSpacing port="source_training" spacing="36"/>
         <portSpacing port="sink_model" spacing="0"/>
         <portSpacing port="sink_through 1" spacing="0"/>
       </process>
       <process expanded="true" height="442" width="279">
         <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
           <list key="application_parameters"/>
         </operator>
         <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="112" y="165"/>
         <connect from_port="model" to_op="Apply Model" to_port="model"/>
         <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
         <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
         <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
         <portSpacing port="source_model" spacing="0"/>
         <portSpacing port="source_test set" spacing="0"/>
         <portSpacing port="source_through 1" spacing="0"/>
         <portSpacing port="sink_averagable 1" spacing="0"/>
         <portSpacing port="sink_averagable 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="255">
       <list key="application_parameters"/>
     </operator>
     <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="165">
       <parameter key="condition_class" value="wrong_predictions"/>
     </operator>
     <connect from_op="Retrieve (2)" from_port="output" to_op="Select by Random" to_port="example set input"/>
     <connect from_op="Select by Random" from_port="example set output" to_op="Multiply" to_port="input"/>
     <connect from_op="Multiply" from_port="output 1" to_op="Validation" to_port="training"/>
     <connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
     <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
     <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
     <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Filter Examples" to_port="example set input"/>
     <connect from_op="Filter Examples" from_port="example set output" to_port="result 2"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • haddock
    haddock New Altair Community Member
    G'Day!

    Actually there is nothing wrong with your code !!! You've applied the model to its own training data, so surprise surprise there are no errors; I've stuck a break in at the crucial juncture to make the point..
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="424" width="681">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="75">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          </operator>
          <operator activated="true" class="select_by_random" expanded="true" height="76" name="Select by Random" width="90" x="45" y="165">
            <parameter key="use_fixed_number_of_attributes" value="true"/>
            <parameter key="number_of_attributes" value="20"/>
          </operator>
          <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="112" y="255"/>
          <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
            <process expanded="true" height="442" width="268">
              <operator activated="true" class="k_nn" expanded="true" height="76" name="k-NN (2)" width="90" x="112" y="75"/>
              <connect from_port="training" to_op="k-NN (2)" to_port="training set"/>
              <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="36"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="442" width="279">
              <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="112" y="165"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" breakpoints="after" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="255">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="165">
            <parameter key="condition_class" value="wrong_predictions"/>
          </operator>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Select by Random" to_port="example set input"/>
          <connect from_op="Select by Random" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Validation" to_port="training"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    In the following I've split the data, and errors start creeping in, and being flagged...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="424" width="681">
          <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Sonar"/>
          </operator>
          <operator activated="true" class="split_data" expanded="true" height="94" name="Split Data" width="90" x="69" y="115">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.5"/>
              <parameter key="ratio" value="0.5"/>
            </enumeration>
          </operator>
          <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="75">
            <process expanded="true" height="442" width="268">
              <operator activated="true" class="k_nn" expanded="true" height="76" name="k-NN (2)" width="90" x="112" y="75"/>
              <connect from_port="training" to_op="k-NN (2)" to_port="training set"/>
              <connect from_op="k-NN (2)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="36"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="442" width="279">
              <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="112" y="165"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" breakpoints="after" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="255">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="165">
            <parameter key="condition_class" value="wrong_predictions"/>
          </operator>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Validation" to_port="training"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Hope that clears the fog!





  • the_duckman
    the_duckman New Altair Community Member
    Cheers,

    The second bit of code works, but Its confusing me.
    (I was not able to run the first bit of code, it caused the software to crash)

    What I am finding confusing is that the models performance vector 79%. If it was trained under cross validation how can it have learnt the training data to 100%.

    I think i am missing something here, any clarification would sure be appreciated.

    -dm
  • the_duckman
    the_duckman New Altair Community Member
    Ok, spent today reading up on what k-NN training was,
    and found out the whole training dataset is stored by the classifier.

    Its all clear now and I am under way again, Thanks a lot for the help Haddock.

    -dm
  • haddock
    haddock New Altair Community Member
    Greets Duckman,

    Cool, well done for thinking it through, definitely the way to go. Get back when you're next underwhelmed  :D