Criterion for overfitting evaluation

Hung_Bui_221 · November 2022

Hello everyone. Have a nice day. I am getting some overfitting trouble. I have been searching the information on RM Community and the other websites. They told that if the accuracy is greater than 90%, I am most probaly facing to overfitting. My case below:

I have the datasets like this:

Image: https://us.v-cdn.net/6030995/uploads/editor/zo/lshqu0yrqoqv.png

Then I created the process using classification (decision tree) with the bank-additional-full.csv as training data and bank-additional.csv as test data. After running, the accuracy is about 97% (and the correlation is about 79%).

I think this is overfitting. Is it correct? If yes, how can I fix this problem? And is there only accuracy to evaluate the overfitting? Please help me. Thank you.

Marco_Barradas · November 2022

Hi @Hung_Bui_221

This videos might help you clarify. You might have an accuracy of 97% whats the recall on the thing that you are trying to predict?

https://academy.rapidminer.com/learn/article/overfitting-outliers

Introduction to Performance Measurement

BalazsBaranyRM · November 2022

Hi!

Just having a high accuracy doesn't mean that you have overfitting. You could also have a good model.

Look at the decision tree and try different pruning parameter settings to control the possibility of overfitting. You'll be able to see if the tree is getting very complex and making nonsensical decisions (like "if first name = Peter then Label") or not.

An overfitted model doesn't work well on new data. Therefore, you just need to make sure that you verify correctly. See these videos in the Academy:
https://academy.rapidminer.com/learn/video/validating-a-model
https://academy.rapidminer.com/learn/video/optimization-of-the-model-parameters

Regards,
Balázs

BalazsBaranyRM · November 2022

Hi!

Bagging and other ensemble methods can help reduce overfitting and make models more robust. When you obtain 10 trees in a bagging model, that's the model. It is probably as good or better than just one tree.

With tree based methods, correlation is not that big of a problem. When one attribute is selected for a split, correlated attributes don't really matter.

If you suspect that the correlation of polynominal attributes might be worse for your model, you should validate that assumption. A good way to test this is using Nominal to Numerical which will re-code the nominal attribute values to new 0/1 attributes. Then you could apply similar correlation based filters.

Regards,
Balázs

Marco_Barradas · November 2022

Hi @Hung_Bui_221

This videos might help you clarify. You might have an accuracy of 97% whats the recall on the thing that you are trying to predict?

https://academy.rapidminer.com/learn/article/overfitting-outliers

Introduction to Performance Measurement

BalazsBaranyRM · November 2022

Hi!

Just having a high accuracy doesn't mean that you have overfitting. You could also have a good model.

Look at the decision tree and try different pruning parameter settings to control the possibility of overfitting. You'll be able to see if the tree is getting very complex and making nonsensical decisions (like "if first name = Peter then Label") or not.

An overfitted model doesn't work well on new data. Therefore, you just need to make sure that you verify correctly. See these videos in the Academy:
https://academy.rapidminer.com/learn/video/validating-a-model
https://academy.rapidminer.com/learn/video/optimization-of-the-model-parameters

Regards,
Balázs

Hung_Bui_221 · November 2022

Hi @BalazsBarany @MarcoBarradas . Thank you for helping me. Now I can understand better about overfitting issue. Here is my result after running the process:

Image: https://us.v-cdn.net/6030995/uploads/editor/f4/36pbtara59en.png

Image: https://us.v-cdn.net/6030995/uploads/editor/gq/9yprebc4hbtu.png

Besides, I have 2 more questions:

1. In Optimize, I use Bagging (with Decision Tree inside) because as I known, this is also a way to reduce overfitting issue. Is it correct? After running, I obtained 10 trees. How can I know which tree should be chosen?

1. As I known, the highly correlated attributes should be removed. So I used Weight by Correlation for numerical and binominal attributes and then removed which ones have correlation greater than 0.95. But how about polynominal attributes? At first I used Weight by Information Gain and Select by Weight for them. Then I was confused and change into Correlation Matrix for all attributes like the image above. In this case, what should I do?

Sorry for long post. And thank you again for noting my questions.

BalazsBaranyRM · November 2022

Hi!

Bagging and other ensemble methods can help reduce overfitting and make models more robust. When you obtain 10 trees in a bagging model, that's the model. It is probably as good or better than just one tree.

With tree based methods, correlation is not that big of a problem. When one attribute is selected for a split, correlated attributes don't really matter.

If you suspect that the correlation of polynominal attributes might be worse for your model, you should validate that assumption. A good way to test this is using Nominal to Numerical which will re-code the nominal attribute values to new 0/1 attributes. Then you could apply similar correlation based filters.

Regards,
Balázs

Criterion for overfitting evaluation

Best Answers

Answers

Categories