Hi,
I just wanted to share some experience I gained regarding the different decision tree realisations in RapidMiner.
As an exercise I wanted to test RapidMiner on a realistically large dataset from the
http://www.data-mining-cup.de/2007/Wettbewerb (DMC2007) which has 50000 records for training, 50000 for testing.
I first worked with RapidMiner 4.2 and had pretty good results with MetaCost and DecisionTree: misclassification cost -0.141 on training set (The more negative, the better. The best participants of about 300 in the DMC2007-challenge achieved -0.1578. A very naive (bad) model would obtain 0.000 when predicting always the most frequent class N).
Then I switched to RapidMiner 4.4, since Steffen convinced me that it was better for many other bug fixes (and it is). When rerunning the
same process (with the revised DT of V4.4), my model now was 50x faster than V4.2, but also very "naive": misclassification cost = 0.000, because it said always "N", the trivial choice: all trees had only one leaf.
I guessed that the new pre-pruning might be the problem and activated parameter no_pre_pruning. This led to a heap space error after 6min.
I assumed that maximal_depth=20 might be too complex and decreased this parameter to 10. Now it worked, it took only 1min (as compared to 12 min in V4.2) and produced a result. But the result was at least for this dataset quantitatively inferior to V4.2, since it had only a misclassification cost of -0.091.
I then replaced operator DecisionTree by WEKA's W-REPTree operator, with parameter L=12, and got right from the start results better than all others before: misclassification costs = -0.147. And W-REPTree is incredible fast (about 5 sec) and does not consume much heap space.
Since the datasets are a little bit too large to be uploaded here, I provide them (500KB zip) under
http://www.gm.fh-koeln.de/~konen/DMDATA/dmc2007_train.zip, if anyone wants to reproduce the results.
Here is the code I use (you may disable W-REPTree and enable DecisionTree to reproduce the last DT-experiment):
<operator name="Root" class="Process" expanded="yes">
<parameter key="random_seed" value="2001"/>
<operator name="MemoryCleanUp" class="MemoryCleanUp">
</operator>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="dmc2007_train.aml"/>
</operator>
<operator name="AttributeAggregation" class="AttributeAggregation">
<parameter key="attribute_name" value="Csum"/>
<parameter key="aggregation_attributes" value="C1[0-9]*"/>
</operator>
<operator name="AttributeAggregation (2)" class="AttributeAggregation">
<parameter key="attribute_name" value="Csum_special"/>
<parameter key="aggregation_attributes" value="C1001[1-2]|C1000[3-4]|C10017"/>
</operator>
<operator name="InfoGainWeighting" class="InfoGainWeighting">
</operator>
<operator name="AttributeWeightsWriter" class="AttributeWeightsWriter">
<parameter key="attribute_weights_file" value="attr01.wgt"/>
</operator>
<operator name="AttributeWeightSelection" class="AttributeWeightSelection">
<parameter key="keep_attribute_weights" value="true"/>
<parameter key="weight" value="0.5"/>
<parameter key="weight_relation" value="top k"/>
<parameter key="k" value="25"/>
<parameter key="use_absolute_weights" value="false"/>
</operator>
<operator name="MetaCost" class="MetaCost" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
<parameter key="use_subset_for_training" value="0.1"/>
<operator name="W-REPTree" class="W-REPTree">
<parameter key="L" value="12.0"/>
</operator>
<operator name="DecisionTree" class="DecisionTree" activated="no">
<parameter key="maximal_depth" value="10"/>
<parameter key="no_pre_pruning" value="true"/>
</operator>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="dmc2007-dt.mod"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="classification_error" value="true"/>
<list key="class_weights">
<parameter key="N" value="1.0"/>
<parameter key="A" value="1.0"/>
<parameter key="B" value="1.0"/>
</list>
</operator>
<operator name="CostEvaluator" class="CostEvaluator">
<parameter key="keep_exampleSet" value="true"/>
<parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
</operator>
</operator>
My conclusion: Perhaps some more tests with the "new" DecisionTree could be done, if time permits: Although it is faster and has probably better results on the datasets it was tested on, it seems to be, that large datasets (or just
this dataset) provide some problems for the DT in its current parameter settings.
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...

Best regards
Wolfgang