"Regression Trees in RapidMiner 5 Community Edition"
christian1983
New Altair Community Member
Hello to everybody,
I am working on my master thesis dealing with data mining aspects, so I began to learn using RapidMiner 5.0.
But there are lot of problems I´m facing, so I hope getting help in this forum.
My problem is to use decision trees to predict quantitative values, so i have to use trees being able to handle numerical labels, called regression trees.
Although RapidMiner 5.0 provide lots of different types of decsion trees being described to be regression trees, they can not handle numerical labels, so i´m a little bit confused about that.
Here an excerpt of my data to be analyzed:
0,0050 0,0413 0,0610 0,01 0,01 0,01 0,120
0,0050 0,0413 0,0610 0,01 0,01 0,01 0,121
0,0050 0,0413 0,0610 0,01 0,01 0,01 0,127
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,099
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,094
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,127
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,097
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,101
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,087
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,038
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,058
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,038
0,0197 0,3550 0,8407 0,03 0,14 0,056 0,100
0,0197 0,3550 0,8407 0,03 0,14 0,056 0,096
Sorry for the bad layout.
The description of the decison trees aimded to be used, tells the following:
"This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. This decision tree learner works similar to Quinlan's C4.5 or CART.
The actual type of the tree is determined by the criterion, e.g. using gain_ratio or Gini for CART / C4.5."
This decision tree working similar to the CART (Classification by regression), but can not handle numerical label.
I hope you can help me.
Thank you.
[/table]
I am working on my master thesis dealing with data mining aspects, so I began to learn using RapidMiner 5.0.
But there are lot of problems I´m facing, so I hope getting help in this forum.
My problem is to use decision trees to predict quantitative values, so i have to use trees being able to handle numerical labels, called regression trees.
Although RapidMiner 5.0 provide lots of different types of decsion trees being described to be regression trees, they can not handle numerical labels, so i´m a little bit confused about that.
Here an excerpt of my data to be analyzed:
input 1 input 2 input 3 input 4 input 5 input6 label |
0,0050 0,0413 0,0610 0,01 0,01 0,01 0,121
0,0050 0,0413 0,0610 0,01 0,01 0,01 0,127
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,099
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,094
0,0037 0,0467 0,0913 0,01 0,01 0,01 0,127
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,097
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,101
0,0030 0,0363 0,0600 0,01 0,01 0,01 0,087
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,038
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,058
0,0030 0,0370 0,0593 0,01 0,01 0,01 0,038
0,0197 0,3550 0,8407 0,03 0,14 0,056 0,100
0,0197 0,3550 0,8407 0,03 0,14 0,056 0,096
Sorry for the bad layout.
The description of the decison trees aimded to be used, tells the following:
"This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. This decision tree learner works similar to Quinlan's C4.5 or CART.
The actual type of the tree is determined by the criterion, e.g. using gain_ratio or Gini for CART / C4.5."
This decision tree working similar to the CART (Classification by regression), but can not handle numerical label.
I hope you can help me.
Thank you.
[/table]
Tagged:
0
Answers
-
Hi there Christian,
Interesting! Just to make sure we are all singing from the same song book, here's some stuff from Wikipedia on Predictive analytics..
So anything that can generate a rule about a number (<,=,>) could be your learner, it could even be a group of learners, what matters is the testing arrangement which applies the rule and checks the result. You will see that RM has a sensible array of operators to do this, bin makers, learners, validators, and genetic parameter optimisers. So you could build a template layout where all you have to do is add the learners to test as parameters, but....
Trees are formed by a collection of rules based on values of certain variables in the modeling data set
* Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable
* Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure)
* Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met
Each branch of the tree ends in a terminal node
* Each observation falls into one and exactly one terminal node
* Each terminal node is uniquely defined by a set of rules
The 'but' is about overtraining. At what stage do you decide that enough is enough? How do you decide that? Just how much data remains unseen, and what data, and why? It shouldn't be too difficult to construct a general purpose testing rig, which would expose the underlying issue... When, exactly, is a pattern really a pattern?
You should have a lot of fun with this, hope so!
0 -
Hi,
RapidMiner does not support a RegressionTree itself. You could use the one of weka, assuming, Weka has one. You could write a RegressionTree yourself and contribute it, what I would prefer, as you might imagine
Greetings,
Sebastian0 -
Hi,
First of all, thank you for your quick reply.
Maybe my problem was not described well by me, but my question refers to the problem, that the provided decision trees in RM 5.0 are not able to handle numerical label, although they belong to the group Modeling.ClassificationandRegression.Tree.
Actually they must deal with numerical labels in order to make prediction based on Regression Tree Algorithm CART.
I hope, my problem is clear now.
Thank you.0 -
Hi christian,
i'm afraid your problem is clear. But so is the answer: RapidMiner has NO learner for regression trees. All tree learners in RapidMiner are classification trees. Nevertheless these trees can handle numeric attributes but not numeric label.
By the way, the group Modeling.ClassificationandRegression contains all learners which are suitable for classification and/or regression.
Concerning the CART algorithm you are right, this also computes regression trees (as far as i know). But as your quote states:
If your master thesis is not directly about regression trees yous could use some of RapidMiners regresion learners."This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which often can also easily be understood. This decision tree learner works similar to Quinlan's C4.5 or CART.
Best regards and good luck with you thesis,
chero0 -
Hi,
Chero ( greets Chero ) is right in my view, the trick here is that you can use classifiers to predict numeric values, but not if you treat them as a continuous range. Effectively you break the range into steps which cover the spectrum, so you are approximating that range. Perhaps the point of my post emerges ?
You don't have to take my word for it...
http://www.dtreg.com/classregress.htm
http://www.resample.com/xlminer/help/rtree/rtree_intro.htm
http://www.cscu.cornell.edu/news/statnews/stnews62.pdf
and 00'0's more ....
To summarise, no classification algorithms handle continuous labels, but that does not mean to say that the relationship between continuous variables cannot be investigated by classification algorithms.
0 -
Hi,
and yes, someday someone will add the missing AR to the CART. In fact we once had a regression tree, but it died in young age and I had to bury one of my first creations with RapidMiner...The only sign left of this little class is this group name...
Greetings,
Sebastian0 -
That is very interesting that there is no regression tree - I too assumed by the names.....although I never had to use one.
Christian - If you need an open source implementation - check out rpart in R. Else, commerical products like Clementine for example have a true RT (as part of CART or CHAID).0 -
I have a similar wish--implementation of CART in RapidMIner. Until that day arrives, I have used W-REPTree to estimate Regression Trees. It is not exactly the same as CART but it does the trick. An interesting variation is W-M5P. Here you have the possibility of estimating a tree with linear regressions as leaves. Check it out.0
-
Hi,
by the way: I'm currently working on R integration for RapidMiner. With this extension, another Regression Tree would be available from within RapidMiner.
And a last note on commercial products: Before buying a commercial product like Clementine because one or two operators are missing, I would suggest contacting us. I'm quite sure we can include that operators for less than half the money Clementine would cost you...
Greetings,
Sebastian0