The Decision Tree gave impossible result

MarkusW
MarkusW New Altair Community Member
edited November 5 in Community Q&A



I just trained a machine using a Decision Tree, that reached an F-score of 99,7%.
Which sounds good until you hear, that naive bayes only got 66,4%
the highest score on that dataset I found was 98,2% using deep learning
The highest CREDIBLE score I found on that dataset was 78,5%

The design is based off of this video:


All I did was replace the Naive Bayes operator in the Crossvalidation with the Decision Tree operator.
Even with 10-fold Crossvalidation I should still not get much more than 70%...

The immediate cause of the high score is, that for some reason there is a strong correlations between the label and the id, however I do not know how to limit which collumns the algorithm uses.
The question is, what did I do wrong? How do I make it right?

Best Answer

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    often its just because the two sets for the two classes got append. So the first half of the data set is true, the second half is false?

    Otherwise: Often ids correlate with dates, which correlate with the label.

    What you want to do is either use Select Attributes and remove the id or set role and set the role of id to id.

    Best,
    Martin

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Did you look at the tree? What is it doing?

    BR,
    Martin
  • MarkusW
    MarkusW New Altair Community Member
    I probably should have, before starting it up with random forest, to see if the problem persists...
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi,

    look at the decision tree. Maybe you left an attribute in the data that correlates so strongly with the label, but wouldn't be available for future data.

    Is the tree complex? Are the decisions obvious? 

    You can put breakpoints on various parts of the process (I'd try with the Decision Tree and Performance) to look at the different validation steps. 

    Regards,
    Balázs
  • MarkusW
    MarkusW New Altair Community Member
    I can say with certainty, that the only thing, that correlates remotly as strong with the correct label, is the label itself.
    I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.
    the Decision Tree had very few settings I could actually change. My best guess is, that I should have either used a "different" Decision tree operator, if there are multiple, or that somehow the 10-fold crossvalidation doesn't work the same way, depending on learning algorithm and I should have changed settings there.
  • BalazsBarany
    BalazsBarany New Altair Community Member
    Hi!

    If this happens again, look at the stepwise execution results. If you get a very simple tree, or unbelievable performance results in different executions, the breakpoints help you identify the problem.

    Sometimes multiple attributes together correlate with the result but not individually. Decision Tree might be better at catching some of these situations.

    Regards,
    Balázs
  • MarkusW
    MarkusW New Altair Community Member
    Ok, despite the correlation not being supposed to be nearly that strong, it's still unwanted, that most of the factors in the tree appear to be the id of the dataset.
    My guess is, that if I forbid it from doing that, I'd get much better results.
    I assume I do that with the "Set Role" operator, but I don't know how.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.


    Thats not true. Especially a NB algorithm can be confused very quickly by the other 'noise' attributes. This is not true for a Decision Tree.


  • MarkusW
    MarkusW New Altair Community Member
    Yes, apparently there is a weirdly strong correlation between "ID" (basically just the line number) and the label. I just need to find out, how to exclude this from the columns, that the algorithm is allowed to use.
    Help is welcome.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Answer ✓
    often its just because the two sets for the two classes got append. So the first half of the data set is true, the second half is false?

    Otherwise: Often ids correlate with dates, which correlate with the label.

    What you want to do is either use Select Attributes and remove the id or set role and set the role of id to id.

    Best,
    Martin
  • BalazsBarany
    BalazsBarany New Altair Community Member
    You can set the role of this column to "id" using Set Role. If you already have an attribute with the role id, just enter a second name (e. g. ItemID). Everything marked with a special role, custom or built-in, is excluded from modeling.