The Decision Tree gave impossible result

MarkusW · October 2021

I just trained a machine using a Decision Tree, that reached an F-score of 99,7%.

Which sounds good until you hear, that naive bayes only got 66,4%

the highest score on that dataset I found was 98,2% using deep learning

The highest CREDIBLE score I found on that dataset was 78,5%

The design is based off of this video:

https://academy.rapidminer.com/learn/video/automatic-classification-of-documents

Image: https://us.v-cdn.net/6030995/uploads/editor/6a/w19m70174a81.png

Image: https://us.v-cdn.net/6030995/uploads/editor/yv/ndf1hluukfa3.png

All I did was replace the Naive Bayes operator in the Crossvalidation with the Decision Tree operator.

Even with 10-fold Crossvalidation I should still not get much more than 70%...

The immediate cause of the high score is, that for some reason there is a strong correlations between the label and the id, however I do not know how to limit which collumns the algorithm uses.

The question is, what did I do wrong? How do I make it right?

MartinLiebig · October 2021

often its just because the two sets for the two classes got append. So the first half of the data set is true, the second half is false?

Otherwise: Often ids correlate with dates, which correlate with the label.

What you want to do is either use Select Attributes and remove the id or set role and set the role of id to id.

Best,

Martin

MartinLiebig · October 2021

Did you look at the tree? What is it doing?

BR,

Martin

MarkusW · October 2021

I probably should have, before starting it up with random forest, to see if the problem persists...

BalazsBaranyRM · October 2021

Hi,

look at the decision tree. Maybe you left an attribute in the data that correlates so strongly with the label, but wouldn't be available for future data.

Is the tree complex? Are the decisions obvious?

You can put breakpoints on various parts of the process (I'd try with the Decision Tree and Performance) to look at the different validation steps.

Regards,
Balázs

MarkusW · October 2021

I can say with certainty, that the only thing, that correlates remotly as strong with the correct label, is the label itself.

I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.

the Decision Tree had very few settings I could actually change. My best guess is, that I should have either used a "different" Decision tree operator, if there are multiple, or that somehow the 10-fold crossvalidation doesn't work the same way, depending on learning algorithm and I should have changed settings there.

BalazsBaranyRM · October 2021

Hi!

If this happens again, look at the stepwise execution results. If you get a very simple tree, or unbelievable performance results in different executions, the breakpoints help you identify the problem.

Sometimes multiple attributes together correlate with the result but not individually. Decision Tree might be better at catching some of these situations.

Regards,
Balázs

MarkusW · October 2021

Ok, despite the correlation not being supposed to be nearly that strong, it's still unwanted, that most of the factors in the tree appear to be the id of the dataset.

My guess is, that if I forbid it from doing that, I'd get much better results.

I assume I do that with the "Set Role" operator, but I don't know how.

MartinLiebig · October 2021

I believe, that if I had made the mistake, that the programm was using the label column to predict the label column, it would have resulted in Naive Bayes also having an incredibly high F-score.

Thats not true. Especially a NB algorithm can be confused very quickly by the other 'noise' attributes. This is not true for a Decision Tree.

MarkusW · October 2021

Yes, apparently there is a weirdly strong correlation between "ID" (basically just the line number) and the label. I just need to find out, how to exclude this from the columns, that the algorithm is allowed to use.

Help is welcome.

MartinLiebig · October 2021

often its just because the two sets for the two classes got append. So the first half of the data set is true, the second half is false?

Otherwise: Often ids correlate with dates, which correlate with the label.

What you want to do is either use Select Attributes and remove the id or set role and set the role of id to id.

Best,

Martin

BalazsBaranyRM · October 2021

You can set the role of this column to "id" using Set Role. If you already have an attribute with the role id, just enter a second name (e. g. ItemID). Everything marked with a special role, custom or built-in, is excluded from modeling.

The Decision Tree gave impossible result

Best Answer

Answers

Categories