How do you reduce variables before doing a decision tree?

Matt_Pilk
Matt_Pilk New Altair Community Member
edited November 5 in Community Q&A
Hi!  
Just wanted some help.
1) do you need to reduce the number of variables before you execute a decision tree analysis? Currently, i have 19.  It makes the decision tree hard to read as i need to go to 12 layers to get the accuracy up.

2) If I use the select attributes for the ones i believe are important after doing some EDA, does this dilute the results or can you pass the decision tree through the original set of data?

any insights from the community would be great.

Thanks,
Matthew

Best Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    Decision tree based methods are all about selecting relevant attributes. If you remove attributes before and the tree changes, these would have been relevant and your tree probably got worse. If the tree doesn't change, the removal actually found irrelevant attributes, but that would have been the case through the decision tree anyway.

    It is a good idea to do an assessment of your attributes and check them for harmful things like "future" knowledge leaking into the model, or data that are hard to get, or attributes having many missing values. You could remove these manually. But you shouldn't remove attributes on the basis "I don't think these are relevant" before using any method that selects or weights attributes itself. That would be "part human, part machine learning" and it is hard to get better results from this process than from an algorithm that was written for this task.

    If your decision tree is too hard to interpret AND interpretability is a more important goal than accuracy, it's better to change the pruning parameters to more strict values. That will give you a smaller, better to understand tree, without sacrificing relevant attributes before its application.

    Regards,
    Balázs
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    It depends on the use case if interpretability of the decision tree is the most important factor.

    Usually that's not the case and I use parameter optimization to get the best decision tree or a model from different learning algorithm. (Decision Tree often isn't the best model.)

    There's an example building block in the Community Samples repository:


    Regards,
    Balázs

Answers

  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    Decision tree based methods are all about selecting relevant attributes. If you remove attributes before and the tree changes, these would have been relevant and your tree probably got worse. If the tree doesn't change, the removal actually found irrelevant attributes, but that would have been the case through the decision tree anyway.

    It is a good idea to do an assessment of your attributes and check them for harmful things like "future" knowledge leaking into the model, or data that are hard to get, or attributes having many missing values. You could remove these manually. But you shouldn't remove attributes on the basis "I don't think these are relevant" before using any method that selects or weights attributes itself. That would be "part human, part machine learning" and it is hard to get better results from this process than from an algorithm that was written for this task.

    If your decision tree is too hard to interpret AND interpretability is a more important goal than accuracy, it's better to change the pruning parameters to more strict values. That will give you a smaller, better to understand tree, without sacrificing relevant attributes before its application.

    Regards,
    Balázs
  • Matt_Pilk
    Matt_Pilk New Altair Community Member
    Thanks Balazs.  When doing the Decision tree, do you keep extending it until you get an accuracy that you feel is acceptable and explainable or do you go for the highest accuracy even if it is 15-20 levels deep?
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Answer ✓
    Hi!

    It depends on the use case if interpretability of the decision tree is the most important factor.

    Usually that's not the case and I use parameter optimization to get the best decision tree or a model from different learning algorithm. (Decision Tree often isn't the best model.)

    There's an example building block in the Community Samples repository:


    Regards,
    Balázs