🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Decision tree : Differents results with/without "Nominal to Numerical" operator

User: "lionelderkrikor"
New Altair Community Member
Updated by Jocelyn

Hi,

 

I'm doing some experimentations with Rapidminer on decision trees and I found strange results : 

When the operator "Nominal to Numerical" (using "dummy coding" for example) is inserted / or not inserted after the training and test datasets, the resulting decision tree is not the same and then the confidence of prediction on test dataset are different for some examples. (these are very little differences : the final prediction is the same in the two cases).

But how can we explain this behaviour ?

What about other classification algo ?

What method should be preferred ?

 

Here the process with the "Nominal to Numerical" operator : 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="build model" width="90" x="581" y="85">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(data):&#10; &#10; base =data.iloc[:,0 :-1]&#10; clf = DecisionTreeClassifier(criterion ='entropy', splitter = 'best',max_depth=20)&#10; clf.fit(base, data['Future Customer'])&#10; return clf,base"/>
</operator>
<operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="apply model" width="90" x="581" y="238">
<parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(model, data):&#10; #base =data[['Age', 'Gender', 'Payment Method']]&#10; base =data.iloc[:,0:-1]&#10; pred = model.predict(base)&#10; proba = model.predict_proba(base)&#10; score = model.score(base,pred)&#10; data['prediction'] = pred&#10; data['probabilité_1'] = proba[:,0]&#10; data['probabilité_2'] = proba[:,1]&#10; data['score'] = score&#10;&#10; #set role of prediction attribute to prediction&#10; data.rm_metadata['prediction']=(None,'prediction')&#10; data.rm_metadata['probabilité_1']=(None,'probabilité_1')&#10; data.rm_metadata['probabilité_2']=(None,'probabilité_2')&#10; return data,base"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
<parameter key="repository_entry" value="//Samples/data/Deals"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Future Customer"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
<parameter key="criterion" value="gini_index"/>
</operator>
<operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals-Testset" width="90" x="45" y="289">
<parameter key="repository_entry" value="//Samples/data/Deals-Testset"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="340">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Retrieve Deals-Testset" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
<description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
</process>
</operator>
</process>

 

Thanks you for your response,

 

Regards,

 

Lionel

 

 

Sort by:
1 - 1 of 11
    User: "Telcontar120"
    New Altair Community Member
    Accepted Answer

    That is indeed a lot of questions!  I may not answer them all in complete detail, but here are some thoughts.

    1. I would say that "standard practice" is not to use "Nominal to Numerical" when dealing with nominal predictors in trees, since that conversion is not necessary for the algorithm to work properly, and it can skew the results (as you have seen) by modifying the information gain of the attributes in their raw form.   Regardless of your approach to the predictors, I would strongly recommend cross-validation as best practice for model construction as well as optimization of key parameters for decision trees.  I see your initial process below does not leverage cross-validation or optimization.
    2. It is of course up to you if you decide you prefer a version of the tree built on dummy variables for other reasons, but in my experience, most analysts would leave the nominal variables alone for trees.  If you do create them, I would say it would be easiest to do it in RapidMiner using the method you have shown.  Full disclosure--I did not take the time to try to examine the specific differences between the two methods you provide below or dive into the python code.  In theory those should be the same.
    3. In the case of ties in confidence, RapidMiner will assign the prediction to the first label (so in the case of nominals, it will depend on the alphabetical sequence of the class values).