Gain Ratio vs. Gini
ollestrat
New Altair Community Member
Hello all,
In the algorithm description "Weight by Tree Importance" I found a remark concerning the split criterion of trees:
Ole
In the algorithm description "Weight by Tree Importance" I found a remark concerning the split criterion of trees:
Is there any theoretical (or maybe empirical) background, in how far Gain Ratio could be superior to the Gini Index? Would be very interesting because I indeed experienced a better performance for the Gain Ratio, compared to a setup with the Gini Index, but I have no idea why and if its statistically significant. Maybe, who ever wrote this algorithm description/wiki-entry can point me to a reference paper or the basic theory behind this assumption?
This algorithm is implemented following the idea from "A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data" by Menze, Bjoen H et all (2009). It has been extended by additional criterias for computing the benefit created from a certain split. The original paper only mentioned Gini Index, this operator additionally supports the more reliable criterions Information Gain and Information Gain Ratio.
Ole
Tagged:
0
Answers
-
Hello ollestrat
This is indeed a very interesting question. Here are my cents:
information gain vs gini index
Given how both values are calculated (see e.g. here), the difference should be unimportant. This paper indeed states in its conclusions (I have not read the whole text), that both measure only disagree 2% of the time.
information gain ratio vs gini index
Information Gain Ratio (formula see e.g. here: http://en.wikipedia.org/wiki/Information_gain_ratio) adds another factor to penalize attributes with too many different values (in case of discrete attributes). Two many values mean that the base of every entropy calculation is rather small (on the average). Who do you trust more ? 1/2 or 5/10 ?
So the resulting interesting question is whether a gini index-ratio would perform as well as information gain ratio. I am pretty sure that a paper about that has already been written and forgotten .
regards,
steffen
0 -
I was going to post a question about this, but I figured this thread was very similar and both of you may have a good answer.
You talk about gain ratio. I'm interested in somehow measuring information compression. Have you found a way to do this in RM?0 -
Hello rakirk
"information compression" ? InformationGain (Ratio) is such a measure.
See for example: http://www.autonlab.org/tutorials/infogain11.pdf
if this is not what you are looking for, please be more specific
regards,
steffen0 -
Hi Steffen,
Thanks and yes you are right that IG is compression- and it is useful for classifying.
I guess I need to be more specific: I am interested in calculating the level of compression of a classifier on data (ie. even though information gain as a learning approach would -> higher compression, I'd like to find a way to quantify the difference between say NBN and IG on a model in terms of compression). Maybe this isn't done a lot, but I have been reading papers about how compression is a good measure for real-world efficiency when compared to accuracy. Is there a way to do this with the performance operator somehow?
0 -
Hi rakirk
What is NBN ?
Additionally, I do not quite understand how you define "compression". Here are some related concepts: In the sense of the second concept, one model is better than another one if the complexity is smaller (so that the danger of overfitting is reduced) meanwhile the performance is comparable. I have always used this concept as a rule of the thumb, but never actually calculated it. It never played a role me, but this could be a sign that I am mining data in the wrong parts of the world or that I am simply doing it wrong .
Regarding your second question: As far as I know, this is not possible within rapidminer. You can imagine that calculating such a measure is not trivial (operators may implement a interface where they specify their complexity, but what to do when different optimization techniques come into play ?).
hope this was helpful,
steffen
0 -
Hey,
In WEKA there is something as Serialized_Train_Set_Size, and Serialized_Model_Size, might be possible to use these as a description length measure.
The latest version of the WEKA C4.5 decision tree uses MDL to select the best tree, which trades off model complexity and accuracy.
Best regards,
Wessel0 -
Hi weseel, thank you for the suggestion, I actually plan on using this for a decision tree problem I have been having. I will have to let you know how well it works.
@steffen- sorry for the confusion in my previous question, I have found the answer to the question I had in my head, though I did not express it well in words. I was looking for the technique of measuring compression: KL divergence
regards,
rk0