Model selection for imbalanced training dataset

phivu
phivu New Altair Community Member
edited November 2024 in Community Q&A

Hi RapidMiner,

 

I'm doing model selection for SVM using the "Optimize Parameters (Grid)" operator, my training dataset is imbalanced/skewed (782 positive examples and 2048 negative examples), so we cannot use Accuracy (= (TP+TN)/(TP+TN+FP+FN)) as a score for model selection (because if the predictor predicts everything as negative, the accuracy will easily reach 2048/(2048+782)= 72.3%). So may I ask if there is a way to choose Precision and Recall, or a combined function of them like F1 score instead of Accuracy? I did look into the parameter list of Performance operator but could not see those scores. Or is there other way to deal with imbalanced dataset like this?

 

I attach my process file here. In this process, I use "Optimize Parameters (Grid)" operator to find the SVM's hyper-parameters that give the best cross-validation performance. This process works very well on a balanced training dataset, now I wonder how to modify it for an imbalanced one. Thank you very much for your help!

 

 

 

Tagged:

Best Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".

     

    Hope this helps,

    Ingo

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Another option is to add weights to balance the classes, since the SVM operator accepts weights.  But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓

    Hi,

     

    Sure - all those measurements (precision, recall, F1 and many more) are available as parameters of the operator "Performance (Binominal Classification)".

     

    Hope this helps,

    Ingo

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    Another option is to add weights to balance the classes, since the SVM operator accepts weights.  But in either case you may want to look at AUC as a performance metric as well, it's my preferred one for classification problems since it does not depend on a single arbitrary cutoff threshold.

  • phivu
    phivu New Altair Community Member

    Thank you Ingo,

    I've already seen the scores in the "Performance (Binominal Classification)" operator!