Performance Measures for Imbalanced Data

ozgeozyazar
ozgeozyazar New Altair Community Member
edited November 2024 in Community Q&A

Hi All !

My question is not directly regards to program but I know that in this community many valuable data miners exists and believe that I might reach the correct answer easily. I am doing decision tree classification and measuring both classification and binomial performance with using different paramater combinations. I need to select one of the good performed model to create decision tree for disease risk factors detection. I have read the article that says " Any performance metric that uses values from both columbs will be inherently sensitive to class skews". This meant to me that if I have imbalanced data I should not use those metrics. Could you please confirm my understanding?


Answers

  • varunm1
    varunm1 New Altair Community Member
    edited May 2019
    Hello @ozgeozyazar

    Actually, it is not like you shouldn't use but these measures vary if there is a class imbalance and can be misleading, for example, accuracy.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    @varunm1i would highly recommend to have a look at AUPRC.
  • varunm1
    varunm1 New Altair Community Member
    edited May 2019
    Thanks, @mschmitz I am not aware that we have AUPRC. Generally, I take trade-off between AUC and kappa but this is good to know.
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @ozgeozyazar,

    To complete @mschmitz post, here a Kaggle article which advices to favor AUPRC (Area Under Precision Recall Curve) as the performance metrics of a model when the dataset is very imbalanced : 

    https://www.kaggle.com/lct14558/imbalanced-data-why-you-should-not-use-roc-curve

    If you want to use the AUPRC (performance) operator in your process in RapidMiner, you have to install the free Operator Toolbox extension.

    Regards,

    Lionel
  • varunm1
    varunm1 New Altair Community Member
    Thanks, @lionelderkrikor for sharing this.
  • MartinLiebig
    MartinLiebig
    Altair Employee

    now that i am on my working pc: Have a look at this paper: https://www.biostat.wisc.edu/~page/rocpr.pdf . I discovered it while working with sven. It proofs that a Curve which dominates in AUPRC also dominates in AUC, but not the other way around. Besides the usual problems i talk about with correlation to business value, i would thus prefer AUPRC, if i know the class balance.

    Best,
    Martin

  • Telcontar120
    Telcontar120 New Altair Community Member
    Interesting readings.  Just remember that which metric is "better" (AUC vs AUPRC) is very much a function of business needs since they are optimizing different things.  As @mschmitz has noted in the past, if you can actually assign a cost to your different classification outcomes (TP, TN, FP, FN) then the best approach is to use the Performance Costs operator and optimize directly for that.  These other curves are simply approaches based on other useful metrics. The other thing to be aware of is that AUPRC is probably less well known, so you might have some difficulties in explaining it even to other data scientists, never mind business users.