Performance Measures for Imbalanced Data

ozgeozyazar
New Altair Community Member
Hi All !
My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?
Tagged:
10
Answers
-
Hello @ozgeozyazar
Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.0 -
@varunm1i would highly recommend to have a look at AUPRC.2
-
Hi @ozgeozyazar,
To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced :
https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve
If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.
Regards,
Lionel1 -
Thanks, @lionelderkrikor for sharing this.1
-
Hi @varunm1 and @lionelderkrikor ,now that i am on my working pc: Have a look at this paper: https://www.biostat.wisc.edu/~page/rocpr.pdf . I discovered it while working with sven. It proofs that a Curve which dominates in AUPRC also dominates in AUC, but not the other way around. Besides the usual problems i talk about with correlation to business value, i would thus prefer AUPRC, if i know the class balance.Best,Martin
2 -
Interesting readings. Just remember that which metric is "better" (AUC vs AUPRC) is very much a function of business needs since they are optimizing different things. As @mschmitz has noted in the past, if you can actually assign a cost to your different classification outcomes (TP, TN, FP, FN) then the best approach is to use the Performance Costs operator and optimize directly for that. These other curves are simply approaches based on other useful metrics. The other thing to be aware of is that AUPRC is probably less well known, so you might have some difficulties in explaining it even to other data scientists, never mind business users.3