Improve Random forest performance

a_polito
a_polito New Altair Community Member
edited November 5 in Community Q&A
Hello! :) I'm working on a random forest predictive model to predict a binary label. The dataset is about 70% and 30% unbalanced. The attributes are numeric and represent financial statement indices or amounts in euros such as EBITDA.

The process includes data reading, selection of features with missing value <10%, normalization (Z transformation), replace missing values with the average, cross-validation with undersampling of the majority label class in the training data, RF with information gain ( 200 trees of depth 15).

The performances are not good; accuracy about 74%, recall weighted 75%, precision weighted 72%; f measure 65.89 (class precision primary class 57%)

How can I improve performance? Do you have any suggestions?

Best Answer

  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hello, and hopefully it's not too late to answer:

    It might be very difficult to answer if we don't know the data, and there might be several strategies. Do you have the possibility of applying some kind of discretization? (converting continuous values into discrete ones or "badges" might help). Do you know if there is any kind of anomaly or trend that might be masked into the data? Those are the ones that I can come up here.

    Also, undersampling might sometimes introduce issues, as the data is artificial. Weighting might be better, if your algorithm supports it.

Answers

  • rfuentealba
    rfuentealba New Altair Community Member
    Answer ✓
    Hello, and hopefully it's not too late to answer:

    It might be very difficult to answer if we don't know the data, and there might be several strategies. Do you have the possibility of applying some kind of discretization? (converting continuous values into discrete ones or "badges" might help). Do you know if there is any kind of anomaly or trend that might be masked into the data? Those are the ones that I can come up here.

    Also, undersampling might sometimes introduce issues, as the data is artificial. Weighting might be better, if your algorithm supports it.