Decision tree : Differents results with/without "Nominal to Numerical" operator
Hi,
I'm doing some experimentations with Rapidminer on decision trees and I found strange results :
When the operator "Nominal to Numerical" (using "dummy coding" for example) is inserted / or not inserted after the training and test datasets, the resulting decision tree is not the same and then the confidence of prediction on test dataset are different for some examples. (these are very little differences : the final prediction is the same in the two cases).
But how can we explain this behaviour ?
What about other classification algo ?
What method should be preferred ?
Here the process with the "Nominal to Numerical" operator :
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="build model" width="90" x="581" y="85">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier def rm_main(data): base =data.iloc[:,0 :-1] clf = DecisionTreeClassifier(criterion ='entropy', splitter = 'best',max_depth=20) clf.fit(base, data['Future Customer']) return clf,base"/>
</operator>
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="apply model" width="90" x="581" y="238">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier def rm_main(model, data): #base =data[['Age', 'Gender', 'Payment Method']] base =data.iloc[:,0:-1] pred = model.predict(base) proba = model.predict_proba(base) score = model.score(base,pred) data['prediction'] = pred data['probabilité_1'] = proba[:,0] data['probabilité_2'] = proba[:,1] data['score'] = score #set role of prediction attribute to prediction data.rm_metadata['prediction']=(None,'prediction') data.rm_metadata['probabilité_1']=(None,'probabilité_1') data.rm_metadata['probabilité_2']=(None,'probabilité_2') return data,base"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
<parameter key="criterion" value="gini_index"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals-Testset" width="90" x="45" y="289">
<parameter key="repository_entry" value="//Samples/data/Deals-Testset"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve Deals-Testset" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
<description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
</process>
</operator>
</process>
Thanks you for your response,
Regards,
Lionel