RM Decision Trees, Adaboost

Legacy User
Legacy User New Altair Community Member
edited November 5 in Community Q&A
Random question on how decision trees work in rapidminer. I'm running a decision tree for a predictive model and at the moment just splitting my dataset into 80% train/ 20% test. It's a polynomial classification problem with numerica and nominal attributes. 2 questions:

1) When I run a single decision tree with the % split validation operator, how come it runs the decision tree training twice? I'm just looking at the log and it runs it once, then I see validation still running and a [2] Decision Tree in the log.

2) When I use adaboost to boost the decision trees, the run time and memory usage exponentially increase with each iteration... e.g. 30 mins first, then 1 hour, then 2 hours etc. Obviously I can't run a model with this kind of resource usage, but why is this the case? I've tried boosting methods in other programs and have not run into exponentially increasing runtimes. Do I have a parameter set wrong?

Thanks!
Tagged:

Answers

  • kovacs_balazs_k
    kovacs_balazs_k New Altair Community Member
    edited July 2020
    Same issue here in 2020, compatibility level of Split Validation operator: 9.7.001.
    I noticed this when I analized the logs about execution times. I also checked if this is indicated by the process status bar and I noticed that there is indeed a modeling operator (Neural Net or SVM) with an index of [2].  So the training phase runs twice .
    Edit: I investigated the issue using brakepoints after the Neural Net operator. The first time, it uses only 70% of the examples to train the network but the second time, the training was executed using the entire dataset.

    Edit 2: As I further investigated the issue, I think I figured out why does the split validation operator behave like this. The main steps of the Split Validation operator are:
    1) Runs the training subprocess using the training data set which is 70% of the entire sample by default. Stores the resulting model (let's call it model1) for later use in the testing subprocess. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.
    2) Runs the training subprocess again using the entire sample (100%). Sets the resulting model (let's call it model2) as the later output of the Split Validation operator on the output port mod.
    3) Runs the testing subprocess using the remaining portion of the entire sample (30% by default). The inner mod input port of the testing subprocess delivers model1 for testing purposes. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.

    So this behavior is intentional, but it would be better if I could turn off the learning for the entire data set using a parameter while I am searching for the best parameter combination. It could reduce the time of search to the half.
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi,

    if the Model output of the validation is not connected, it shouldn't run the model building twice.

    Regards,
    Balázs
  • jacobcybulski
    jacobcybulski New Altair Community Member
    If you think of Split Validation as a kind of Cross-Validation then it makes sense. First the model is run on its training fold and performance statistics are collected, and then a model is created using all of the available data for its later deployment.
  • Telcontar120
    Telcontar120 New Altair Community Member
    Why not simply use cross-validation and avoid all these pitfalls associated with split validation in the first place?