RM Decision Trees, Adaboost

Legacy User · October 2011

Random question on how decision trees work in rapidminer. I'm running a decision tree for a predictive model and at the moment just splitting my dataset into 80% train/ 20% test. It's a polynomial classification problem with numerica and nominal attributes. 2 questions:

1) When I run a single decision tree with the % split validation operator, how come it runs the decision tree training twice? I'm just looking at the log and it runs it once, then I see validation still running and a [2] Decision Tree in the log.

2) When I use adaboost to boost the decision trees, the run time and memory usage exponentially increase with each iteration... e.g. 30 mins first, then 1 hour, then 2 hours etc. Obviously I can't run a model with this kind of resource usage, but why is this the case? I've tried boosting methods in other programs and have not run into exponentially increasing runtimes. Do I have a parameter set wrong?

Thanks!

kovacs_balazs_k · July 2020

Same issue here in 2020, compatibility level of Split Validation operator: 9.7.001.
I noticed this when I analized the logs about execution times. I also checked if this is indicated by the process status bar and I noticed that there is indeed a modeling operator (Neural Net or SVM) with an index of [2]. So the training phase runs twice .
Edit: I investigated the issue using brakepoints after the Neural Net operator. The first time, it uses only 70% of the examples to train the network but the second time, the training was executed using the entire dataset.

Edit 2: As I further investigated the issue, I think I figured out why does the split validation operator behave like this. The main steps of the Split Validation operator are:
1) Runs the training subprocess using the training data set which is 70% of the entire sample by default. Stores the resulting model (let's call it model1) for later use in the testing subprocess. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.
2) Runs the training subprocess again using the entire sample (100%). Sets the resulting model (let's call it model2) as the later output of the Split Validation operator on the output port mod.
3) Runs the testing subprocess using the remaining portion of the entire sample (30% by default). The inner mod input port of the testing subprocess delivers model1 for testing purposes. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.

So this behavior is intentional, but it would be better if I could turn off the learning for the entire data set using a parameter while I am searching for the best parameter combination. It could reduce the time of search to the half.

BalazsBaranyRM · July 2020

Hi,

if the Model output of the validation is not connected, it shouldn't run the model building twice.

Regards,
Balázs

jacobcybulski · July 2020

If you think of Split Validation as a kind of Cross-Validation then it makes sense. First the model is run on its training fold and performance statistics are collected, and then a model is created using all of the available data for its later deployment.

Telcontar120 · July 2020

Why not simply use cross-validation and avoid all these pitfalls associated with split validation in the first place?

RM Decision Trees, Adaboost

Answers

Categories