"Regression Trees in RapidMiner 5 Community Edition"

christian1983
christian1983 New Altair Community Member
edited November 5 in Community Q&A
Hello to everybody,

I am working on my master thesis dealing with data mining aspects, so I began to learn using RapidMiner 5.0.
But there are lot of problems I´m facing, so I hope getting help in this forum.
My problem is to use decision trees to predict quantitative values, so i have to use trees being able to handle numerical labels, called regression trees.
Although RapidMiner 5.0 provide lots of different types of decsion trees being described to be regression trees, they can not handle numerical labels, so i´m a little bit confused about that.

Here an excerpt of my data to be analyzed:
input 1 input 2    input 3      input 4    input 5      input6      label
0,0050 0,0413 0,0610 0,01         0,01 0,01   0,120
0,0050 0,0413 0,0610 0,01          0,01 0,01      0,121
0,0050 0,0413 0,0610 0,01        0,01         0,01     0,127
0,0037 0,0467 0,0913 0,01         0,01         0,01     0,099
0,0037 0,0467 0,0913 0,01         0,01         0,01      0,094
0,0037 0,0467 0,0913 0,01         0,01  0,01      0,127
0,0030 0,0363 0,0600 0,01  0,01  0,01     0,097
0,0030 0,0363 0,0600 0,01  0,01         0,01      0,101
0,0030 0,0363 0,0600 0,01 0,01         0,01      0,087
0,0030 0,0370 0,0593 0,01 0,01         0,01     0,038
0,0030 0,0370 0,0593 0,01         0,01         0,01     0,058
0,0030 0,0370 0,0593 0,01         0,01         0,01     0,038
0,0197 0,3550 0,8407 0,03         0,14  0,056  0,100
0,0197 0,3550 0,8407 0,03         0,14         0,056    0,096

Sorry for the bad layout.
The description of the decison trees aimded to be used, tells the following:
"This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. This decision tree learner works similar to Quinlan's C4.5 or CART.
The actual type of the tree is determined by the criterion, e.g. using gain_ratio or Gini for CART / C4.5."
This decision tree working similar to the CART (Classification by regression), but can not handle numerical label.

I hope you can help me.

Thank you.


[/table]

Answers

  • haddock
    haddock New Altair Community Member
    Hi there Christian,

    Interesting! Just to make sure we are all singing from the same song book, here's some stuff from Wikipedia on Predictive analytics..

    Trees are formed by a collection of rules based on values of certain variables in the modeling data set

        * Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable
        * Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure)
        * Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met

    Each branch of the tree ends in a terminal node

        * Each observation falls into one and exactly one terminal node
        * Each terminal node is uniquely defined by a set of rules

    So anything that can generate a rule about a number (<,=,>) could be your learner, it could even be a group of learners, what matters is the testing arrangement which applies the rule and checks the result. You will see that RM has a sensible array of operators to do this, bin makers, learners, validators, and genetic parameter optimisers. So you could build a template layout where all you have to do is add the learners to test as parameters, but....

    The 'but' is about overtraining. At what stage do you decide that enough is enough? How do you decide that? Just how much data remains unseen, and what data, and why? It shouldn't be too difficult to construct a general purpose testing rig, which would expose the underlying issue... When, exactly, is a pattern really a pattern?

    You should have a lot of fun with this, hope so!


  • land
    land New Altair Community Member
    Hi,
    RapidMiner does not support a RegressionTree itself. You could use the one of weka, assuming, Weka has one. You could write a RegressionTree yourself and contribute it, what I would prefer, as you might imagine :)

    Greetings,
      Sebastian
  • christian1983
    christian1983 New Altair Community Member
    Hi,

    First of all, thank you for your quick reply.
    Maybe my problem was not described well by me, but my question refers to the problem, that the provided decision trees in RM 5.0 are not able to handle numerical label, although they belong to the group Modeling.ClassificationandRegression.Tree.
    Actually they must deal with numerical labels in order to make prediction based on Regression Tree Algorithm CART.

    I hope, my problem is clear now.

    Thank you.
  • cherokee
    cherokee New Altair Community Member
    Hi christian,

    i'm afraid your problem is clear. But so is the answer: RapidMiner has NO learner for regression trees. All tree learners in RapidMiner are classification trees. Nevertheless these trees can handle numeric attributes but not numeric label.

    By the way, the group Modeling.ClassificationandRegression contains all learners which are suitable for classification and/or regression.

    Concerning the CART algorithm you are right, this also computes regression trees (as far as i know). But as your quote states:
    "This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. This decision tree learner works similar to Quinlan's C4.5 or CART.
    If your master thesis is not directly about regression trees yous could use some of RapidMiners regresion learners.

    Best regards and good luck with you thesis,
    chero
  • haddock
    haddock New Altair Community Member
    Hi,

    Chero ( greets Chero ) is right in my view, the trick here is that you can use classifiers to predict numeric values, but  not if you treat them as a continuous range. Effectively you break the range into steps which cover the spectrum, so you are approximating that range. Perhaps the point of my post emerges ?

    You don't have to take my word for it...

    http://www.dtreg.com/classregress.htm
    http://www.resample.com/xlminer/help/rtree/rtree_intro.htm
    http://www.cscu.cornell.edu/news/statnews/stnews62.pdf

    and 00'0's more ....

    To summarise, no classification algorithms handle continuous labels,  but that does not mean to say that the relationship between continuous variables cannot be investigated by classification algorithms.



  • land
    land New Altair Community Member
    Hi,
    and yes, someday someone will add the missing AR to the CART. In fact we once had a regression tree, but it died in young age and I had to bury one of my first creations with RapidMiner...The only sign left of this little class is this group name...

    Greetings,
      Sebastian
  • B_Miner
    B_Miner New Altair Community Member
    That is very interesting that there is no regression tree - I too assumed by the names.....although I never had to use one.

    Christian - If you need an open source implementation - check out rpart in R. Else, commerical products like Clementine for example have a true RT (as part of CART or CHAID).
  • earmijo
    earmijo New Altair Community Member
    I have a similar wish--implementation of CART in RapidMIner. Until that day arrives, I have used W-REPTree to estimate Regression Trees. It is not exactly the same as CART but it does the trick. An interesting variation is W-M5P. Here you have the possibility of estimating a tree with linear regressions as leaves. Check it out.
  • land
    land New Altair Community Member
    Hi,
    by the way: I'm currently working on R integration for RapidMiner. With this extension, another Regression Tree would be available from within RapidMiner.

    And a last note on commercial products: Before buying a commercial product like Clementine because one or two operators are missing, I would suggest contacting us. I'm quite sure we can include that operators for less than half the money Clementine would cost you...

    Greetings,
      Sebastian