Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

No decision tree created with parameter criterion to "gini_index"

Good morning,

I used the "Decision Tree" operator to create a model with a training dataset.

With parameter "criterion" to "gini_index" no decision tree is created on the results : The differents attributes are not taken into account.

When the parameter "criterion " is "accuracy", or "gain-ratio" or "information_gain", the decision trees are good created.

My training dataset and scoreset are in attached files

Here my process in xml :

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Training" width="90" x="112" y="34">
        <parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Training"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
        <parameter key="attribute_name" value="User_ID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="380" y="34">
        <parameter key="attribute_name" value="eReader_Adoption"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="514" y="34">
        <parameter key="criterion" value="gini_index"/>
        <parameter key="maximal_depth" value="20"/>
        <parameter key="apply_pruning" value="true"/>
        <parameter key="confidence" value="0.25"/>
        <parameter key="apply_prepruning" value="true"/>
        <parameter key="minimal_gain" value="0.1"/>
        <parameter key="minimal_leaf_size" value="2"/>
        <parameter key="minimal_size_for_split" value="4"/>
        <parameter key="number_of_prepruning_alternatives" value="3"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Scoring" width="90" x="112" y="238">
        <parameter key="repository_entry" value="//DataMiningForTheMasses/data/Chapter10DataSet_Scoring"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="313" y="238">
        <parameter key="attribute_name" value="User_ID"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="136">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <connect from_op="Training" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
      <connect from_op="Set Role (3)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Scoring" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Is it a bug ?

Can you help me ?

Thank you

Lionel

Find more posts tagged with

AI Studio

Decision Tree

Accepted answers

earmijo

Try unchecking the setting Apply Pre-pruning

Screen Shot 2017-11-12 at 3.25.20 PM.png

earmijo

Let me add a couple of sentences to Thomas_Ott's answer. I was confused myself when I started using RapidMiner.

You can find a nice and clear explanation of both pruning and pre-pruning here:

Machine Learning: Pruning Decision Trees

You should experiment in your process with all the variations.

Pre-pruning (early stopping): You stop splitting if no significant benefit results from an additional split.

Pruning (post-pruning): You keep splitting until you reach the desired number of levels (depth = the main measure of complexity of the tree) but you try to simplify the tree afterwards.

Neither Pre-pruning nor Pruning : Try it. The tree will grow symmetrically until reaching the desired number of levels (depth).

IF processing time is not an issue, there is no reason to ever use the pre-pruning option. In .the worst case, you'll end up with the same performance metric, but there is a chance (real as your example illustrates) that you'll end up doing better with (post-pruning).

All comments

earmijo

Try unchecking the setting Apply Pre-pruning

Screen Shot 2017-11-12 at 3.25.20 PM.png

Thomas_Ott

@lionelderkrikor you have to also understand that the criterion all have different ways of the splitting the dataset into a tree. It might be that gini_index is not a good criteron to split your data.

lionelderkrikor

Hi,

earmijo

By unchecking Apply Pre-pruning, a decision tree is good created in my case.

I'm beginner in RM and data-science : Can you explain me what is the goal of checking "Pre-pruning" ? In which case(s) must I check (or not) this option.

Because in my case, when checked (and all related parameters set to the default value), there is only one node with as conclusion the class (it is a four class label attribute problem) which is in majority in the training set (refer attached file). So when applied, this model predict this one unique class to the entire score data set.

Thanks you

Lionel

decision_tree_apply_prepruning.docx

Thomas_Ott

Pruning and Pre-pruning are ways to reduce the overall complexity of the tree. The more complex the tree gets, the more it can overfit your data. Decision Trees are notorious for overfitting (or being abused to overfit). Pruning helps reducing the possibly (not eliminate) of overfitting.

lionelderkrikor

Hi Thomas,

Thank you for your explanation. I understand better the role of these options.

Regards,

Lionel

earmijo

Let me add a couple of sentences to Thomas_Ott's answer. I was confused myself when I started using RapidMiner.

You can find a nice and clear explanation of both pruning and pre-pruning here:

Machine Learning: Pruning Decision Trees

You should experiment in your process with all the variations.

Pre-pruning (early stopping): You stop splitting if no significant benefit results from an additional split.

Pruning (post-pruning): You keep splitting until you reach the desired number of levels (depth = the main measure of complexity of the tree) but you try to simplify the tree afterwards.

Neither Pre-pruning nor Pruning : Try it. The tree will grow symmetrically until reaching the desired number of levels (depth).

lionelderkrikor

Hi @earmijo

Thank you for your feedback and your ressources about decision trees.

If I understand, I must be very careful when using decision trees :

I have to try all combinaisons [criterion - no apply /apply pruning - no apply / apply prepruning] and

perform an evaluation of the accuracy of the created models using a split validation to select the best model.

Regards,

Lionel

MartinLiebig

Hi,

i would be careful with a simple split-validation and rather use a X-Validation with a proper hold out set.

Best,

Martin

lionelderkrikor

Hi @mschmitz,

Thank you for your advise : I'll use a X-Validation on the models.

Regards,

Lionel