Modelling a decision tree with very large data?

eldenoso
eldenoso New Altair Community Member
edited November 2024 in Community Q&A

Hello altogehter,

 

currently I am trying to create decision tree models with large data. The problem which occurs is, that the decision tree either gets to large (wide) or to small, so that accuracy is low and connections can't be identified. I already tried doing different things like discretize numerical attributes etc. But it won't work well. Most of the attributes are of the type nominal, just one is of the numerical type. Contrary to the titanic-example I don't have a label with "yes/no". I already thought that this may cause the problem? 

Thank you for your help! :)

Philipp

Best Answer

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    A few additional thoughts:

    1. minimal gain for split is a crucial pre-pruning parameter in my experience, so you may want to try a wider range for that and see how it affects your tree
    2. if you have nominal attributes with a lot of distinct values, you should consider consolidation or aggregation of those, since too many individual values can lead to low counts in any particular value
    3. if a flat decision tree isn't working well, you might consider an ensemble model built on trees such as Random Forest or Gradient Boosted Trees

     

     

     

Answers

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Hey @eldenoso, I see that you were trying to reduce the features by discretizing them. Did you also try to adjust the pruning and pre-pruning parameters as well? 

  • eldenoso
    eldenoso New Altair Community Member

    Thank you for your reply Thomas! 

    Yes, I played with all three parameter (confidence, minimal leafsize, minimal leafsize of split) but I can't come up with something useable or "easy to read" like the titanic example did. 


  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    Did you change your Tree Depth parameter? The default is 20 which is pretty big, I just usually set it to 5. 

     

    Both the Min Leaf and Min Leaf to Split are pretty important as Pre-pruning parameters. I would try bumping those values up to something larger than you have now. 

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    A few additional thoughts:

    1. minimal gain for split is a crucial pre-pruning parameter in my experience, so you may want to try a wider range for that and see how it affects your tree
    2. if you have nominal attributes with a lot of distinct values, you should consider consolidation or aggregation of those, since too many individual values can lead to low counts in any particular value
    3. if a flat decision tree isn't working well, you might consider an ensemble model built on trees such as Random Forest or Gradient Boosted Trees

     

     

     

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi,

     

    i support Brian's arguments. Decision Trees are great tool to start with and to still keep an understanding of your model. But i think you run into the limitations of what you can do with a tree. Just think the limitations of how a single tree of depth 5 can cut into your hyperspace. This can not a very detailed classification.

     

    I would recommend to try a random forest and later a gbt. You loose interpretability but get prediction performance.

     

    Best,

    Martin

  • eldenoso
    eldenoso New Altair Community Member

    Thank you all for your help! 

    I integrated all your optimizations into my process. To make the tree more "readable" I set the prepruning parameters different (minimal gain 0.01 and moreover I set the general confidence up to 0.25). Moreover since my label consisted of nearly twenty different names I tried to classify them into two groups with I think had the biggest impact on my tree. Positively, the accuracy didn't decrease. The contrary happened, it increased (x-Validation 82 %). 

    So to put it briefly in a nutshell I have a tree I can work with! :)

    Thank you again for your answers! 

    Regards,

    Philipp