Performance Measures for Imbalanced Data

ozgeozyazar · May 2019

Hi All !

My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?

varunm1 · May 2019

Hello @ozgeozyazar

Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.

MartinLiebig · May 2019

@varunm1i would highly recommend to have a look at AUPRC.

varunm1 · May 2019

Thanks, @mschmitz I am not aware that we have AUPRC. Generally, I take trade-off between AUC and kappa but this is good to know.

lionelderkrikor · May 2019

Hi @ozgeozyazar,

To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced :

https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve

If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.

Regards,

Lionel

varunm1 · May 2019

Thanks, @lionelderkrikor for sharing this.

MartinLiebig · May 2019

Hi @varunm1 and @lionelderkrikor ,

now that i am on my working pc: Have a look at this paper: https://www.biostat.wisc.edu/~page/rocpr.pdf . I discovered it while working with sven. It proofs that a Curve which dominates in AUPRC also dominates in AUC, but not the other way around. Besides the usual problems i talk about with correlation to business value, i would thus prefer AUPRC, if i know the class balance.

Best,

Martin

Telcontar120 · May 2019

Interesting readings. Just remember that which metric is "better" (AUC vs AUPRC) is very much a function of business needs since they are optimizing different things. As @mschmitz has noted in the past, if you can actually assign a cost to your different classification outcomes (TP, TN, FP, FN) then the best approach is to use the Performance Costs operator and optimize directly for that. These other curves are simply approaches based on other useful metrics. The other thing to be aware of is that AUPRC is probably less well known, so you might have some difficulties in explaining it even to other data scientists, never mind business users.

Performance Measures for Imbalanced Data

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories