Could anyone please explain how Rapidminer implementation of Random Forest operator handles missing values in attributes.
Hi,
Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.0.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/> </operator> <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value" width="90" x="246" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Sex"/> <parameter key="mode" value="nominal"/> <parameter key="nominal_value" value="Female"/> </operator> <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="447" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Age"/> <parameter key="mode" value="expression"/> <parameter key="nominal_value" value="Female"/> <parameter key="expression_value" value="Age>40"/> </operator> <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.0.000" expanded="true" height="103" name="Random Forest" width="90" x="648" y="34"/> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Declare Missing Value" to_port="example set input"/> <connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/> <connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="Random Forest" to_port="training set"/> <connect from_op="Random Forest" from_port="model" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
Note that for numerical attributes it results in a 3-way split.
With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.
Regards,
Sebastian
Hi,
Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:
Note that for numerical attributes it results in a 3-way split.
With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.
Regards,
Sebastian