"Some experience with the (new) decision trees"
wokon
New Altair Community Member
Hi,
I just wanted to share some experience I gained regarding the different decision tree realisations in RapidMiner.
As an exercise I wanted to test RapidMiner on a realistically large dataset from the http://www.data-mining-cup.de/2007/Wettbewerb (DMC2007) which has 50000 records for training, 50000 for testing.
I first worked with RapidMiner 4.2 and had pretty good results with MetaCost and DecisionTree: misclassification cost -0.141 on training set (The more negative, the better. The best participants of about 300 in the DMC2007-challenge achieved -0.1578. A very naive (bad) model would obtain 0.000 when predicting always the most frequent class N).
Then I switched to RapidMiner 4.4, since Steffen convinced me that it was better for many other bug fixes (and it is). When rerunning the same process (with the revised DT of V4.4), my model now was 50x faster than V4.2, but also very "naive": misclassification cost = 0.000, because it said always "N", the trivial choice: all trees had only one leaf.
I guessed that the new pre-pruning might be the problem and activated parameter no_pre_pruning. This led to a heap space error after 6min.
I assumed that maximal_depth=20 might be too complex and decreased this parameter to 10. Now it worked, it took only 1min (as compared to 12 min in V4.2) and produced a result. But the result was at least for this dataset quantitatively inferior to V4.2, since it had only a misclassification cost of -0.091.
I then replaced operator DecisionTree by WEKA's W-REPTree operator, with parameter L=12, and got right from the start results better than all others before: misclassification costs = -0.147. And W-REPTree is incredible fast (about 5 sec) and does not consume much heap space.
Since the datasets are a little bit too large to be uploaded here, I provide them (500KB zip) under http://www.gm.fh-koeln.de/~konen/DMDATA/dmc2007_train.zip, if anyone wants to reproduce the results.
Here is the code I use (you may disable W-REPTree and enable DecisionTree to reproduce the last DT-experiment):
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...
Best regards
Wolfgang
I just wanted to share some experience I gained regarding the different decision tree realisations in RapidMiner.
As an exercise I wanted to test RapidMiner on a realistically large dataset from the http://www.data-mining-cup.de/2007/Wettbewerb (DMC2007) which has 50000 records for training, 50000 for testing.
I first worked with RapidMiner 4.2 and had pretty good results with MetaCost and DecisionTree: misclassification cost -0.141 on training set (The more negative, the better. The best participants of about 300 in the DMC2007-challenge achieved -0.1578. A very naive (bad) model would obtain 0.000 when predicting always the most frequent class N).
Then I switched to RapidMiner 4.4, since Steffen convinced me that it was better for many other bug fixes (and it is). When rerunning the same process (with the revised DT of V4.4), my model now was 50x faster than V4.2, but also very "naive": misclassification cost = 0.000, because it said always "N", the trivial choice: all trees had only one leaf.
I guessed that the new pre-pruning might be the problem and activated parameter no_pre_pruning. This led to a heap space error after 6min.
I assumed that maximal_depth=20 might be too complex and decreased this parameter to 10. Now it worked, it took only 1min (as compared to 12 min in V4.2) and produced a result. But the result was at least for this dataset quantitatively inferior to V4.2, since it had only a misclassification cost of -0.091.
I then replaced operator DecisionTree by WEKA's W-REPTree operator, with parameter L=12, and got right from the start results better than all others before: misclassification costs = -0.147. And W-REPTree is incredible fast (about 5 sec) and does not consume much heap space.
Since the datasets are a little bit too large to be uploaded here, I provide them (500KB zip) under http://www.gm.fh-koeln.de/~konen/DMDATA/dmc2007_train.zip, if anyone wants to reproduce the results.
Here is the code I use (you may disable W-REPTree and enable DecisionTree to reproduce the last DT-experiment):
My conclusion: Perhaps some more tests with the "new" DecisionTree could be done, if time permits: Although it is faster and has probably better results on the datasets it was tested on, it seems to be, that large datasets (or just this dataset) provide some problems for the DT in its current parameter settings.
<operator name="Root" class="Process" expanded="yes">
<parameter key="random_seed" value="2001"/>
<operator name="MemoryCleanUp" class="MemoryCleanUp">
</operator>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="dmc2007_train.aml"/>
</operator>
<operator name="AttributeAggregation" class="AttributeAggregation">
<parameter key="attribute_name" value="Csum"/>
<parameter key="aggregation_attributes" value="C1[0-9]*"/>
</operator>
<operator name="AttributeAggregation (2)" class="AttributeAggregation">
<parameter key="attribute_name" value="Csum_special"/>
<parameter key="aggregation_attributes" value="C1001[1-2]|C1000[3-4]|C10017"/>
</operator>
<operator name="InfoGainWeighting" class="InfoGainWeighting">
</operator>
<operator name="AttributeWeightsWriter" class="AttributeWeightsWriter">
<parameter key="attribute_weights_file" value="attr01.wgt"/>
</operator>
<operator name="AttributeWeightSelection" class="AttributeWeightSelection">
<parameter key="keep_attribute_weights" value="true"/>
<parameter key="weight" value="0.5"/>
<parameter key="weight_relation" value="top k"/>
<parameter key="k" value="25"/>
<parameter key="use_absolute_weights" value="false"/>
</operator>
<operator name="MetaCost" class="MetaCost" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
<parameter key="use_subset_for_training" value="0.1"/>
<operator name="W-REPTree" class="W-REPTree">
<parameter key="L" value="12.0"/>
</operator>
<operator name="DecisionTree" class="DecisionTree" activated="no">
<parameter key="maximal_depth" value="10"/>
<parameter key="no_pre_pruning" value="true"/>
</operator>
</operator>
<operator name="ModelWriter" class="ModelWriter">
<parameter key="model_file" value="dmc2007-dt.mod"/>
</operator>
<operator name="ModelApplier" class="ModelApplier">
<parameter key="keep_model" value="true"/>
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<parameter key="keep_example_set" value="true"/>
<parameter key="classification_error" value="true"/>
<list key="class_weights">
<parameter key="N" value="1.0"/>
<parameter key="A" value="1.0"/>
<parameter key="B" value="1.0"/>
</list>
</operator>
<operator name="CostEvaluator" class="CostEvaluator">
<parameter key="keep_exampleSet" value="true"/>
<parameter key="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
</operator>
</operator>
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...
Best regards
Wolfgang
Tagged:
0
Answers
-
Hi Wolfgang,
thank you for sharing your experience and your experiment setup. I have only downloaded it yesterday but had only a brief look until now. Indeed, the decision tree seems to perform worse than a Weka tree or other RM learners. Anyway, the decision tree is already on my/our list for another revision. Presumably the pre pruning is the reason for producing less nice trees.
Well I am not that fond of that idea to introduce multiple versions of operators in one version of RM. In my opinion that presumably leads to confusion among most of the users. Nevertheless of course we will try to approach the problem very soon and implement a decision tree version which is stable and performant. When I will have a look at the tree, I will definitely consider your training example as well. So thanks again for sharing. We will keep you up to date when we will make any progress.wokon wrote:
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...
Kind regards,
Tobias0 -
Hi Wolfgang,
I would also like to thank you for the nice description of your experiences. Actually those things are really helping us a lot because without those information we are always a bit project-driven (which is at least better than being "UCI-driven" ).
Just two side notes:
I fully understand and I would also like to prevent those changes if they occur too often - especially for core operators like the DT learner. On the other hand, the old implementation was ridicously slow for both smaller and larger data sets and we simply had to do something about this. And although I understand your point, I personally think that the agile and dynamic style of our development actually is something many users do really like about RapidMiner. Just consider that people who are happy with everything do not nearly post as often in a public forum than people having problems. That is sad but another topic
For my taste the rate and depth of changes in the operators is a little bit too high ...
Isn't it nice to have the ability to test all of those methods and choose the best?
The W-REPTree seems to be the better choice in this case.
Actually, the REPTree and the DecisionTree operator are like apples and bananas. A "fairer" comparison would be to compare "DecisionTree" with "W-J48". But if you already found that REPTree is best in your case this is of course great - and a nice motivation for Tobias to improve the DecisionTree in a way that it produces as good results in the same (or even better: in shorter) times :P
Thanks again for your nice comments and all the best,
Ingo0 -
Hello Ingo and Tobias,
It definitely is!! I would probably never have tested W-REPTree if it was not lying around there in the RapidMiner-Toolbox. I think this is one of the strengths of RapidMiner that it brings so many concepts you otherwise had not the time to really work with just to your fingertips (or tooltips... )
Isn't it nice to have the ability to test all of those methods and choose the best?
Having said this, let me comment on Tobias:
Yes, I understand your point, but on the other hand, there are so many, many operators in RM that another one or two more (perhaps in a special folder 'Deprecated') would not hurt too much, won't they? But this is a matter of taste, and I understand your arguments as well.Well I am not that fond of that idea to introduce multiple versions of operators in one version of RM.
Thanks again for the fast response, I think I get an idea now what the "Rapid" in "Rapid-I" stands for ...
Best regards
Wolfgang0