Issue found in feature weight of RandomForest for regression

marcin_blachnik
marcin_blachnik New Altair Community Member
edited November 5 in Community Q&A
It seems that there is an issue or a bug in the feature_weights returned by RandomForest operator, but only for regression. I found that problem on one dataset but I reconstructed it on IRIS dataset for which features a3 and a4 are the most important but according to the regression RandomForest these two features are the least important.
I evaluated other implementations of RandomForest for regression which returns correct weights (weights which are expected).

Best regards
Marcin

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    I had submitted a big quite some time ago regarding the RandomForest weights.  It looks like it may still be uncorrected and this is another example of the same underlying issue.
  • marcin_blachnik
    marcin_blachnik New Altair Community Member
    Hi

    I'm surprised that such requests are ignored. Many use RandomForest weights as a feature importance indicator and make serious decisions based on it.
    It would be also nice if someone from RM would answer "thank you, we will analyze the reported issue" but there is no response.

    Below I attach another process where it can be seen that the attribute with pure noise is the second most important variable according to RapidMiner implementation of RandomForest (the most important also seems to be attribute selected by chance). Because the trees are simple (5 trees of depth 5) one can count how many times each attribute appeared as a decision node. The noise variable is the least important.

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    I have the odd feeling, that the weights generation does not take the number of examples into account, but just sums the gain node. Would this explain the behaviour?

    ~Martin
  • marcin_blachnik
    marcin_blachnik New Altair Community Member
    HI

    I haven't checked the source code but I have a feeling that the problem is deeper. In the example from my previous post, where the Random Forest consists of 5 trees it can be seen that the noise attribute A5 appears only twice in the trees, while A3 and A4 appear the most often. For classification, the weights work correctly so I think that may be related to the criterion and its properties.
    Never the less it would be great if RM correct it in the upcoming release.

    Best regards
  • gmeier
    gmeier New Altair Community Member
    edited January 2021
    thank you for the bug report. We found the problem and fixed it. It will be part of the next release.