🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Bug in ModelApplier?"

User: "wokon"
New Altair Community Member
Updated by Jocelyn
Intro: First of all, I would like to congratulate the Rapid-I team  to this great piece of software. The user interface and philosophy behind the data and operator handling is well-designed, intuitive and the set of algorithms & visualizations is very rich.

However, I stumbled over quite a bug when I tried to solve as an exercise the DMC'2007 challenge with RapidMiner. It seems to me that something is going wrong with the ModelApplier when combining MetaCost with certain datasets.

Bug: ModelApplier seems to change the label headings in a dataset, and this leads to completely different classification errors on the same data.

How to reproduce: There are two small datasets, dmc2007_test_small.csv and dmc2007_test_sm_2.csv attached to this post. The datasets contain each exactly the same set of 149 records, with the only difference that the order of the records is slightly rearranged: Labels are N…NBN…NA… in dmc2007_test_small.csv and N…NAN…NB… in dmc2007_test_sm_2.csv (only two lines interchanged).
When you run dmc2007_test_small.csv  through the following script, the number of B-labels changes completely (from 11 to 23) when you pass the data through ModelApplier (see attached screenshots in my_results.pdf, the classification error goes from 30% to 26%). This is not the case with dmc2007_test_sm_2.csv, there everything is OK. The script is

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename" value="dmc2007_test_small.csv"/>
        <parameter key="id_column" value="1"/>
        <parameter key="label_column" value="22"/>
    </operator>
    <operator name="ModelLoader" class="ModelLoader">
        <parameter key="model_file" value="dmc2007-dt.mod"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="ClassificationPerformance" class="ClassificationPerformance">
        <list key="class_weights">
          <parameter key="N" value="1.0"/>
          <parameter key="A" value="999.0"/>
          <parameter key="B" value="1.0"/>
        </list>
        <parameter key="classification_error" value="true"/>
        <parameter key="keep_example_set" value="true"/>
    </operator>
</operator>
Remark: The model dmc2007-dt.mod can be trained using the script below.  dmc2007_test_sm_2.csv has the same order of label appearance as in the training data set. Here is the training script:

<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource" breakpoints="after">
        <parameter key="filename" value="dmc2007_train_small.csv"/>
        <parameter key="id_column" value="1"/>
        <parameter key="label_column" value="22"/>
    </operator>
    <operator name="DecisionTree" class="DecisionTree">
        <parameter key="keep_example_set" value="true"/>
    </operator>
    <operator name="ModelWriter" class="ModelWriter">
        <parameter key="model_file" value="dmc2007-dt2.mod"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="ClassificationPerformance" class="ClassificationPerformance">
        <list key="class_weights">
          <parameter key="N" value="1.0"/>
          <parameter key="A" value="999.0"/>
          <parameter key="B" value="1.0"/>
        </list>
        <parameter key="classification_error" value="true"/>
        <parameter key="keep_example_set" value="true"/>
    </operator>
    <operator name="CostEvaluator" class="CostEvaluator">
        <parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
        <parameter key="keep_exampleSet" value="true"/>
    </operator>
</operator>

This seems somewhat disturbing to me since ModelApplier changes the incoming data (“label”) which it is expected to read only.
And of course things can get much worse: if we put a record with label “B” as first record of the dataset (again the set is exactly the same) we get an appearent classification error of 86% (which is again due to the wrong labels, the predictions of the model are exactly the same).

Recently I found out: The bug is not dependent on the MetaCost part of the training model, the same thing happens if we just use a decision tree as model.

Another topic: it is not clear to me how the rows and columns in the cost matrix connect to the labels (at least I can not see it in the documentation, however I found it out by try-and-error that probably the order of occurrence in the training set defines the rows). It would be nice to have the cost matrix interface extended in such a way that it is clear what is true / predicted (row or column?) and which line corresponds to what label.

Wish you all the best for your product, we are currently considering using it in some of our Master and Bachelor Data Mining courses.

Best regards

Wolfgang Konen

Institut für Informatik,
FH Köln - Campus Gummersbach
Steinmüllerallee 1
51643 Gummersbach
www.gm.fh-koeln.de/~konen

P.S: Since no one reported to my bug description ID: 2686544 in SourceForge’s Rapid-I-Bug-Tracker (March, 13th), I post it here again. I tried to put it in a more concise form so that you can see better the error :). Just as a note: If you solve this, also the Bug with ID: 2686544 is done. Hope to see some sort of reaction this time...

P.P.S.: If you do not maintain the BugTracker at SourceForge (which I can understand, you have already lots to do with the forum), it would perhaps be nice to put a note saying so in http://sourceforge.net/tracker/?group_id=114160&;atid=667390  ;)

WK



[attachment deleted by admin]

Find more posts tagged with