Random Forest
Liverpool_Reds
New Altair Community Member
Could anyone please explain how Rapidminer implementation of Random Forest operator handles missing values in attributes.
Tagged:
0
Answers
-
Hi,
Both in Random Forest and Decision Trees, missing values are treated like a separate data value, both for numerical and nominal attributes. You can check it out yourself in the following process:
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.0.000" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Sex"/>
<parameter key="mode" value="nominal"/>
<parameter key="nominal_value" value="Female"/>
</operator>
<operator activated="true" class="declare_missing_value" compatibility="9.0.000" expanded="true" height="82" name="Declare Missing Value (2)" width="90" x="447" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Age"/>
<parameter key="mode" value="expression"/>
<parameter key="nominal_value" value="Female"/>
<parameter key="expression_value" value="Age>40"/>
</operator>
<operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.0.000" expanded="true" height="103" name="Random Forest" width="90" x="648" y="34"/>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Declare Missing Value" to_port="example set input"/>
<connect from_op="Declare Missing Value" from_port="example set output" to_op="Declare Missing Value (2)" to_port="example set input"/>
<connect from_op="Declare Missing Value (2)" from_port="example set output" to_op="Random Forest" to_port="training set"/>
<connect from_op="Random Forest" from_port="model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Note that for numerical attributes it results in a 3-way split.
With Decision Tree models, inputing missing values doesn't improve the model, unless you have a very precise way to do it.
Regards,
Sebastian
1