Why are the distributions in trees of Random Forests incorrect?

Question

Im using a Random Forest to discover rules based on a simple dataset. After computing the model I check the trees to find leaves with a high confidence. when comparing the number of records shown by the tree description with the data in the dataset it turns out that the numbers are wrong. For instance, I have one attribute with a 50/50 distribution (greater than 0 and less than 0). The tree has the correct split value (0) but has 10 more records in the left branch.
Any ideas?

MartinLiebig · Accepted Answer

Hi,can you maybe provide an example for this?

Keep in mind that a Random Forest works on a bootstrapped set of the original data set, this may explain deviations.

Best,Martin

MartinLiebig · Accepted Answer

Hi @Friedemann ,of course. There are usually two factors, which make a Random Forest random.
First each node is only 'seeing' a subset of all attributes and than taking the best split in them.
Second, each tree is trained not on the full data set, but on 90% of the original data set. This 90% is not a random sample, but it is a bootstrapped sample. This means examples can be taken twice or even three times. (Ziehen mit Zurücklegen). 
Have a look at the following process, which generates a  forest, which only consists of "root nodes". You can see that each root node as different distributions of yes/no. This is because of this bootstrapping.

Best,Martin