Calculating Variance and Entropy Explained
I've been asked a few times about the calculations for Decision Trees which you see scattered across different areas in Knowledge Studio including Information Value, Entropy Explained, and Variance Explained.
Depending on whether you have a Numeric or Categorical Dependent Variable (DV) you will see these measures showing up in places such as the Data Node "Segment Viewer", and the Variable Selection, Measures of Predictive Power and Decision Tree nodes.
I'll walk through the calculations for each of these using Decision Trees and Strategy Trees to show my working in case you want to follow along.
Numeric DV – Variance Explained
When you look at the accuracy of a Decision Tree with a numeric DV you'll see something like this (there's some rounding error in the one displayed in the tree):
Note, you can get this from running Validation or select 'Resubstitution' from the Tools menu:
The Help gives the formulae:
Essentially "Variance Explained" and "Ratio Variance" are the same, which is calculated by the ratio of variance at the root node and the variance from across the leaf nodes.
The Input Variance is the Sum Squared Errors (SSE) for the parent node.
The Output Variance is the SSE across all the leaf nodes.
We then divide the Output Variance from the Input Variance and subtract that from 1. Easy, right?
Let's work through and example of a one split tree.
We have a tree with 20 records split with the following values for our DV age: 8 rows of 25, 3 rows of 24, 6 of 23, and 3 of 22
We're going to split it into three bins by 'Occupation'. The diagram below shows the 'Tree' along with averages at each node, the 'Node Data' in the selected root node is shown on the bottom left, and the 'Split Report' shows the Input, Output and Ratio Variance (with Info being ratio variance * 100).
The Mean at the root is: 23.80 (you can calculate this yourself, or just trust the 'Average' shown on the node)
The child nodes have the data split in the following way (hopefully I typed this correctly):
node 1: 22, 22
node 2: 25, 25, 24, 24, 24, 23, 23, 23, 23, 23, 23
node 3: 25, 25, 25, 25, 25, 25, 22
Because this is a one split tree Ratio Variance and Info in the "Split Report" tab give the same result as the Variance Explained.
Let's calculate the Sum of Square Errors. We subtract each value in the node mean/average and square the result e.g. the first record is (25-23.8)^2.
Squaring it does a couple of things, it penalizes larger distances from the mean, and makes everything positive (negative * negative = positive).
So our calculation becomes:
Input Variance= SSE= 8*(25-23.8)^2 + 3*(24-23.8)^2 +6*(23-23.8)+ 6*(22-23.8)^2 = 25.2
And then we'll do the same for each of the child nodes, trusting the node calculations,
Mean at node 1= 22,
Mean at node 2= 23.64,
Mean at node 3= 24.57
Child node SSEs are calculated the same as for the root node above to get Output Variance.
I was lazy and used a strategy tree to show this for those following at home.
Which gives the following:
Variance Explained is the ratio of the sum of each of the SSEs for the child nodes to the parent node, so
1-(Output Variance
/Input Variance
)
Putting in the values that we calculated
Ratio Variance= 1-((0+6.5455+7.7143)/25.2) = ~0.43414
Success!
Categorical DVs – Entropy Explained
Now that we have no Averages and Deviations, the result relies entirely on the split of the DV in the nodes. Remembering that the aim of the Decision Tree is to maximise the differences between categories.
Our new formulae:
So again we're looking at the ratio of the measure of the input/parent node to the sum of the measure in the output/child nodes.
With the formula for Entropy (G above) being:
The effect of this for a binary DV is shown in this diagram (from https://www.saedsayad.com/decision_tree.htm)
Where p is the population of one category of the DV and q is the population of the other expressed as a fraction of the total – we'll show this in the example below.
The diagram shows that Entropy trends to 0 for more 'pure' nodes (only one category exists) and trends towards 1 for the more equal the distribution of the DV categories. The Decision Tree is trying to get close to 0 for leaf nodes.
Let's break this all down with a walkthrough.
Again we're going to use a single split tree to show the calculations. In this case we're trying to determine the probability of 'Sex' with a split by 'Occupation'. You can see in the diagram below that we have an even split of genders – 10 of each giving a total of 20 records. Occupation in the 'Split Report' also gives an Input Entropy of 1 and a ratio entropy of 0.13613.
For those interested, Resubstitution gives:
Accuracy is calculated by the misclassification rate (if the probability is > 0.5 then that's the classification of the node and the other category is marked as misclassified). E.g. The node on the right has 3 females and 7 males, with a probability of 0.7 for Male, giving the 3 females as misclassified. Across the tree we have 6 misclassified out of 20 = 6/20 = 0.3. Accuracy is the ones we got right i.e. 1-misclassified.
But we're here to calculate Entropy, not Accuracy.
Let's make this easier to see by using the Strategy Tree again.
Here's the formula for Entropy from above:
Which gives the following on the tree:
The input Entropy/ Entropy in the root node is given as 1. This is because we have a perfect split of Female and Male records in the node.
Remember the formula above for Entropy where p and q are the fractions of the two categories in the DV?
If we substitute this into the formula above p=0.5 (10 Female/20 Total records) and q=0.5 (10 Male/20 Total records) and we hit the top of that curve with Entropy = 1.
We then can do the same for the following nodes (the fraction is the same as the percentage shown on the tree):
Node 1: p=0.6, q=0.4, gives Entropy = 0.97095
Node 2: p=0.8, q=0.2, gives Entropy = 0.72193
Node 3: p=0.3, q=0.7, gives Entropy = 0.88129
Now for the last step, because we've split the tree into 3 nodes with a different number of records in each, we need to weight the result accordingly so we can't do a straight average, we need to take into account the population across each node (probability of being in each node).
This is where this OutputIndex formula is used:
With n being the number of records in each child node (summed for nodes 1 to i - fancy talk for all child nodes) and N being the number of records in the parent node. G is the Entropy we calculated before.
Putting this into a single diagram:
So the final equation becomes:
OutputIndex= 0.25*0.97095 + 0.25*0.72193 + 0.5*0.88129 = 0.863865
Looks about right?
And Finally…
Entropy Explained = 1-OutputIndex
/InputIndex
= 1- 0.86387/1 = 0.13613
Congratulations, you can now manually calculate your own Decision Trees!!
------------------------------Alex Gobolos
Sales Engineer
Altair
Toronto ON
------------------------------