🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Decision tree : Differents results with/without "Nominal to Numerical" operator

User: "lionelderkrikor"
New Altair Community Member
Updated by Jocelyn

Hi,

 

I'm doing some experimentations with Rapidminer on decision trees and I found strange results : 

When the operator "Nominal to Numerical" (using "dummy coding" for example) is inserted / or not inserted after the training and test datasets, the resulting decision tree is not the same and then the confidence of prediction on test dataset are different for some examples. (these are very little differences : the final prediction is the same in the two cases).

But how can we explain this behaviour ?

What about other classification algo ?

What method should be preferred ?

 

Here the process with the "Nominal to Numerical" operator : 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="build model" width="90" x="581" y="85">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(data):&#10; &#10; base =data.iloc[:,0 :-1]&#10; clf = DecisionTreeClassifier(criterion ='entropy', splitter = 'best',max_depth=20)&#10; clf.fit(base, data['Future Customer'])&#10; return clf,base"/>
</operator>
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="apply model" width="90" x="581" y="238">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(model, data):&#10; #base =data[['Age', 'Gender', 'Payment Method']]&#10; base =data.iloc[:,0:-1]&#10; pred = model.predict(base)&#10; proba = model.predict_proba(base)&#10; score = model.score(base,pred)&#10; data['prediction'] = pred&#10; data['probabilité_1'] = proba[:,0]&#10; data['probabilité_2'] = proba[:,1]&#10; data['score'] = score&#10;&#10; #set role of prediction attribute to prediction&#10; data.rm_metadata['prediction']=(None,'prediction')&#10; data.rm_metadata['probabilité_1']=(None,'probabilité_1')&#10; data.rm_metadata['probabilité_2']=(None,'probabilité_2')&#10; return data,base"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
<parameter key="criterion" value="gini_index"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals-Testset" width="90" x="45" y="289">
<parameter key="repository_entry" value="//Samples/data/Deals-Testset"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve Deals-Testset" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
<description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
</process>
</operator>
</process>

 

Thanks you for your response,

 

Regards,

 

Lionel