Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Decision tree : Differents results with/without "Nominal to Numerical" operator

Hi,

I'm doing some experimentations with Rapidminer on decision trees and I found strange results :

When the operator "Nominal to Numerical" (using "dummy coding" for example) is inserted / or not inserted after the training and test datasets, the resulting decision tree is not the same and then the confidence of prediction on test dataset are different for some examples. (these are very little differences : the final prediction is the same in the two cases).

But how can we explain this behaviour ?

What about other classification algo ?

What method should be preferred ?

Here the process with the "Nominal to Numerical" operator :

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="build model" width="90" x="581" y="85">
        <parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(data):&#10;    &#10;    base =data.iloc[:,0 :-1]&#10;    clf = DecisionTreeClassifier(criterion ='entropy', splitter = 'best',max_depth=20)&#10;    clf.fit(base, data['Future Customer'])&#10;    return clf,base"/>
      </operator>
      <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="apply model" width="90" x="581" y="238">
        <parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(model, data):&#10;    #base =data[['Age', 'Gender', 'Payment Method']]&#10;    base =data.iloc[:,0:-1]&#10;    pred = model.predict(base)&#10;    proba = model.predict_proba(base)&#10;    score = model.score(base,pred)&#10;    data['prediction'] = pred&#10;    data['probabilité_1'] = proba[:,0]&#10;    data['probabilité_2'] = proba[:,1]&#10;    data['score'] = score&#10;&#10;    #set role of prediction attribute to prediction&#10;    data.rm_metadata['prediction']=(None,'prediction')&#10;    data.rm_metadata['probabilité_1']=(None,'probabilité_1')&#10;    data.rm_metadata['probabilité_2']=(None,'probabilité_2')&#10;    return data,base"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Samples/data/Deals"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Future Customer"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
        <parameter key="criterion" value="gini_index"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Deals-Testset" width="90" x="45" y="289">
        <parameter key="repository_entry" value="//Samples/data/Deals-Testset"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="246" y="340">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve Deals" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Apply Model (2)" to_port="model"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Deals-Testset" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
      <description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
    </process>
  </operator>
</process>

Thanks you for your response,

Regards,

Lionel

Find more posts tagged with

AI Studio

Decision Tree

Nominal to Numerical

Accepted answers

Telcontar120

That is indeed a lot of questions! I may not answer them all in complete detail, but here are some thoughts.

I would say that "standard practice" is not to use "Nominal to Numerical" when dealing with nominal predictors in trees, since that conversion is not necessary for the algorithm to work properly, and it can skew the results (as you have seen) by modifying the information gain of the attributes in their raw form. Regardless of your approach to the predictors, I would strongly recommend cross-validation as best practice for model construction as well as optimization of key parameters for decision trees. I see your initial process below does not leverage cross-validation or optimization.
It is of course up to you if you decide you prefer a version of the tree built on dummy variables for other reasons, but in my experience, most analysts would leave the nominal variables alone for trees. If you do create them, I would say it would be easiest to do it in RapidMiner using the method you have shown. Full disclosure--I did not take the time to try to examine the specific differences between the two methods you provide below or dive into the python code. In theory those should be the same.
In the case of ties in confidence, RapidMiner will assign the prediction to the first label (so in the case of nominals, it will depend on the alphabetical sequence of the class values).

All comments

Telcontar120

One reason that the tree is different is that after converting a single nominal variable into a series of dummy variables, the information gain (or related measure of tree performance) associated with any individual attribute is not the same in both cases, so the tree growing algorithm is not necessarily going to select the same attributes in the same order as it otherwise would. As you noted, the actual differences in outcomes should not be wildly dissimilar since the underlying data being used is not different, but it is enough to cause differences in the trees, especially as it interacts with whatever pruning and prepruning options are set. You can test this easily enough with the Golf dataset.

lionelderkrikor

Hi @Telcontar120

Thank you for your response elements.

Following your response and your suggestion, I followed my experimentations and......... I have several (new) questions.....

1. Decision trees : Dummy variables or not, that's the question......

In deed with the "Golf dataset", there are significant differences (using or not the dummies variables) between the created models and then on the confidences, then on the prediction and finally on the accuracy of the created models.

So in practice what is the best approach ?

- To create 2 models with the 2 approaches, perform a X-validation on the created models and select the best model ? an other one ?

- More generally is this approach applicable to other classification algo ?

2. Decision tree : How to generate dummy variables .......?

This is a strange question.... but how said previously, i followed my experimentations :

I decided to create the dummy variables with 2 methods :

- one with the "Numeral to Numerical" operator (how previously). Here the process :

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="build model" width="90" x="581" y="85">
        <parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(data):&#10;    &#10;    base =data.iloc[:,0 :-1]&#10;    clf = DecisionTreeClassifier(criterion ='entropy', splitter = 'best',max_depth=20)&#10;    clf.fit(base, data['Future Customer'])&#10;    return clf,base"/>
      </operator>
      <operator activated="false" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="68" name="apply model" width="90" x="581" y="238">
        <parameter key="script" value="from sklearn.tree import DecisionTreeClassifier&#10;def rm_main(model, data):&#10;    #base =data[['Age', 'Gender', 'Payment Method']]&#10;    base =data.iloc[:,0:-1]&#10;    pred = model.predict(base)&#10;    proba = model.predict_proba(base)&#10;    score = model.score(base,pred)&#10;    data['prediction'] = pred&#10;    data['probabilité_1'] = proba[:,0]&#10;    data['probabilité_2'] = proba[:,1]&#10;    data['score'] = score&#10;&#10;    #set role of prediction attribute to prediction&#10;    data.rm_metadata['prediction']=(None,'prediction')&#10;    data.rm_metadata['probabilité_1']=(None,'probabilité_1')&#10;    data.rm_metadata['probabilité_2']=(None,'probabilité_2')&#10;    return data,base"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="85">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="85">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Play"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="85">
        <parameter key="attribute_name" value="Play"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
        <parameter key="criterion" value="gini_index"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="289">
        <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" compatibility="7.6.001" expanded="true" height="103" name="Nominal to Numerical (2)" width="90" x="179" y="289">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Play"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
        <list key="comparison_groups"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="340">
        <parameter key="attribute_name" value="Play"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="648" y="391">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Nominal to Numerical (2)" to_port="example set input"/>
      <connect from_op="Nominal to Numerical (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
      <description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
    </process>
  </operator>
</process>

- one with Python language using the pd.get_dummies() function (with the "Execute Python" operator). Here the process :

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="136">
        <parameter key="repository_entry" value="//Samples/data/Golf"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="136">
        <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;    data = pd.get_dummies(data,columns = ['Outlook', 'Wind'] )&#10;&#10;    # connect 2 output ports to see the results&#10;    return data"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="136">
        <parameter key="attribute_name" value="Play"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="447" y="136">
        <parameter key="criterion" value="gini_index"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="340">
        <parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python (2)" width="90" x="179" y="340">
        <parameter key="script" value="import pandas as pd&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;    data = pd.get_dummies(data,columns = ['Outlook','Wind'] )&#10;&#10;    # connect 2 output ports to see the results&#10;    return data"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="340">
        <parameter key="attribute_name" value="Play"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model" width="90" x="447" y="340">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="581" y="289">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve Golf" from_port="output" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Retrieve Golf-Testset" from_port="output" to_op="Execute Python (2)" to_port="input 1"/>
      <connect from_op="Execute Python (2)" from_port="output 1" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <connect from_op="Performance" from_port="performance" to_port="result 1"/>
      <connect from_op="Performance" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <description align="center" color="orange" colored="true" height="322" resized="true" width="227" x="170" y="10">convert nominal values to unique integers (needed this way in Python operators)</description>
      <description align="center" color="green" colored="true" height="317" resized="true" width="270" x="427" y="14">build model using sklearn module in Python and apply the model to test data</description>
    </process>
  </operator>
</process>

After running, the 2 created process don't create the same decision three, and then the same confidence, and then the same prediction and finally the 2 models haven't the same accuracy.

How to explain this behaviour ?

3. Subsidiary question : Classification predictions of RM

For a two class problem (YES/NO in the case of Golf), when the displayed confidences by RM are :

- confidence(no) = 0.500

- confidence (yes) = 0.500

What is the rule used by RM to generate the "prediction" ?

That's a lot of questions.

Once again, thank you for your responses.

Regards,

Lionel

Telcontar120

That is indeed a lot of questions! I may not answer them all in complete detail, but here are some thoughts.

I would say that "standard practice" is not to use "Nominal to Numerical" when dealing with nominal predictors in trees, since that conversion is not necessary for the algorithm to work properly, and it can skew the results (as you have seen) by modifying the information gain of the attributes in their raw form. Regardless of your approach to the predictors, I would strongly recommend cross-validation as best practice for model construction as well as optimization of key parameters for decision trees. I see your initial process below does not leverage cross-validation or optimization.
It is of course up to you if you decide you prefer a version of the tree built on dummy variables for other reasons, but in my experience, most analysts would leave the nominal variables alone for trees. If you do create them, I would say it would be easiest to do it in RapidMiner using the method you have shown. Full disclosure--I did not take the time to try to examine the specific differences between the two methods you provide below or dive into the python code. In theory those should be the same.
In the case of ties in confidence, RapidMiner will assign the prediction to the first label (so in the case of nominals, it will depend on the alphabetical sequence of the class values).

MartinLiebig

Hi,

please keep inmind that the conversion into dummy variables need to be done inside of x-val to do it right.

Best,

Martin

lionelderkrikor

Hi @Telcontar120

Thank you for giving me some time and thank you for your advises.

There are not cross-validation / optimization of the parameters in my processes because, I'm someone curious and I decided to perform "simple experimentations" to study the impact of dummy variables (or nominal variables) in the created models (and their performance)

I'm discovering data science and Rapidminer and I'am someone who seeks to understand when something seems to be strange

or illogical that's why ..........I ask a lot of questions.

Once again, thank you.

Best regards,

Lionel

Telcontar120

Of course, these types of questions are both welcomed and encouraged! Good luck as you dive deeper into data science and RapidMiner.

lionelderkrikor

Hi @mschmitz

I had not forgotten your advise in a previous topic to perform X-validation in a process, but in this case, I just wanted study impact of dummy variables / nominal variables on a decision tree.

But you're right to remind me this practice : I have to adopt the "best practices" from the beginning...

After running a process which include a cross validation inside a parameter optimization operator, the impact of dummies variables / nominal variables is insignificant : Both models have the same accuracy.

I have understood that a "standard practice" is not to use "Nominal to Numerical" when dealing with nominal predictors in trees.

However, I see in the RM's community statistics, that for the users which are using the "Nominal to Numerical" operator

~80% choose the "dummy coding". So in which case(s), this operator is adapted ? There are general rules ? depending on the study case ? depending of the algorithm ?

Thank you for your response(s)

Regards,

Lionel

MartinLiebig

Hi Lionel,

very good Questions. Let me go into detail.

Nominal to Numerical - Dummy Coding

Most learning algorithms (SVM, Neural Nets, GLMs) require you to add numerical input. A hard task for a data scientist is to get the data into a numerical format. The two quick ways are either Unique Integers or Dummy Coding (also called one hot-encoding). Both have downsides.

The unique integer one implies a metric. Let's say you have the classes {green,yellow,red} and map them to 1,2 and 3. Is the distance between red and green 2x the distance red and yellow? That's not a given. It might be (in case of a traffic light) but this depends on your use case. If you can align with a "natural" order - great. That solves it better than anything else.

The method of dummy coding, on the other hand, does not imply such a thing. The downside of it is that it generates many attributes. This increases co-linearity and runtime of your algorithms. This is especially true if you have many classes. One counter-strategy would be to group Nominals into bigger groups. Generally, it is preferable to use dummy coding over the other method - hence the 80%.

In your special case, you don't need to do it because Tree-Based models (Decision Trees, RF, GBTs) handle nominals very well. That's one of the reasons they are very popular and also powerful.

X-Val and Dummy Coding

That's a tricky thing. I often explain it like this:

Whenever your algorithm does something which is based on the data you could call it part of the model. A good question is for example - why is k-NN or k-Means not normalizing on its own? The normalize operator is extracting min/max or std_dev/mean from the data to do so and is thus part of it.

Nominal to Numerical is also kind of extracting something from the data - the existence of classes. If the class does not exist in the data the attribute for it is not in. In rare cases, you might create a "class=A" with only 0s in for testing. That can create a biased result.

I hope this is understandable.

Cheers,

Martin

lionelderkrikor

Hi Martin,

Thanks you for taking the time to answer me : Your explanations were instructive.

I understand better these fundamental notions that are the process to transform data into numeral format and dummy coding.

Best regards,

Lionel

aisyahwahyuna

Hi, I also want to ask the same question,

I want to perform a regression task to predict continuous response. I have 4 categorical variables, others are numerical.

Categorical variables are:

age=(≤20, 21-35, 36-50, ≥51)

gender=(Female, Male)

income level=(1=insufficient, 2=sufficient)

BMI range=(1=<25, 2=>25)

*Income level & BMI are keyed in as numerical code in my dataset

Let's say I want to perform SVM, RF, Decision Tree, MLR, and KNN;

1. Should I convert all categorical variables into dummy variables?

2. If using numerical coding is more suitable, should I change the data type to nominal (binominal/polynominal) or retain it as integer?

BalazsBaranyRM

Hi!

Tree based methods work better with nominal data compared to dummy-coded numbers. You only need the dummy coding for algorithms that can't use nominal attributes (e. g. SVM, Linear Regression, Neural Net).

But it is always better to just check. It's easy to compare variants of preprocessing in RapidMiner.

Numeral coding should result in your attribute values being number. If you have nominal "numbers", you don't win anything. The numeric-only algorithms will still refuse to work on your data.

Regards,
Balázs