A program to recognize and reward our most engaged community members
Could anyone please explain how Rapidminer implementation of Random Forest operator handles missing values in attributes.
Hi,
Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.0.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34"> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/> </operator> <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value" width="90" x="246" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Sex"/> <parameter key="mode" value="nominal"/> <parameter key="nominal_value" value="Female"/> </operator> <operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="447" y="34"> <parameter key="attribute_filter_type" value="single"/> <parameter key="attribute" value="Age"/> <parameter key="mode" value="expression"/> <parameter key="nominal_value" value="Female"/> <parameter key="expression_value" value="Age>40"/> </operator> <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.0.000" expanded="true" height="103" name="Random Forest" width="90" x="648" y="34"/> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Declare Missing Value" to_port="example set input"/> <connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/> <connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="Random Forest" to_port="training set"/> <connect from_op="Random Forest" from_port="model" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator></process>
Note that for numerical attributes it results in a 3-way split.
With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.
Regards,
Sebastian