Heritage Health: problems creating a useable Random Forest model

joei005
joei005 New Altair Community Member
edited November 5 in Community Q&A
Hello,

I am having problems developing a useable Random Forest model in the RapidMiner GUI.

The dataset is from the Heritage Healthcare contest. It has approximately 144 attributes and over 70k examples.

The datatypes are mostly numeric and binomial. The label is numeric.

I am new to RapidMiner GUI and am trying to create a simple Random Forest model.

The process is straignt-forward. It reads in a .csv files, set the roles, discretes the numeric label using 10 bins, splits the process into modeling and validation and writes out the model.

When I initially ran the process, all the trees contained one node with a range for the predicted value of negative infinity to 0.278.

When I turned off pruning and pre-pruning, the process failed with an error message of "cannot clone example set".

When I turned off pre-prunning BUT turned on prunning, the process didn't fail but didn't produce better results. When I swithed the algorithm type to  gini_varinace, the model produced trees with multiple nodes.

However, when I checked the performance of the model from the validation process, the model predicts only the range negative infirnity to 0.287. The performance operatior indicates that this gives an 84% performance.

Do you know how to modify the model so that more ranges are used in the prediction?

I lowered the gain needed to create a new node to 0.05 and decreased the confidence level from 0.25 to 0.05.

Thanks!
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    The random forest has a lot of parameters which want to be optimized, as always in data mining patience is your friend :) The Optimize Parameters or Loop Parameters in combination with a log operator will greatly ease the job of finding good parameters. In addition you may want to try different implementation of the Random Forest, such as W-Random Forest from the Weka extension, and also try completely different algorithms such as SVM, as a quick shot maybe Naive Bayes etc.

    Just experiment with the possibilities ;)

    Best, Marius