Newbie - expected performance output -after using the sample operator
AmsDani
New Altair Community Member
Hi, sorry for the beginners question... I have a data set with 30,000 lines. The target variable is imbalanced : total false: 24000 / total true: 6000. So I have used the operator "sample" to balance it ( 1000 each) . At the end the performance classification operator gives the confusion matrix with only 2000 results ( from the sample). I was expecting the evaluation ( totals per TP/ TN/ FP/ FN) based on the total lines of the entire dataset ( 30,000 in total ) in order to evaluate costs as well ( on the performance costs operator ). What have I missed ? Maybe the issue is in the wrong lines used for the input/ outputs connectors ? Any tips where it can go wrong? I have tried many ways.... Thanks in advance for your help!
Tagged:
0
Best Answers
-
As you selected only 2000 examples for model building and validation, this is what you get in the confusion matrix. However, since you use cost as a method of model evaluation, you can also use a cost sensitive model to deal with class imbalance, e. g. decision tree. I assume the cost if misclassifying the minority class is high (e. g. positive case, when representing fraud) and the cost of misclassifying the majority class is low (negative case). When cost structure is set up in this way, in model training, the importance of the majority class can be weighed down in favour of the minority class, thus overcoming the problem of class imbalance.1
-
Another way to solve this is moving the sampling *into* the training phase of the cross validation. That way, you're building balanced models, but still validating on all data.
Also, sampling before the validation creates additional "knowledge" for the modeling process that you won't have later when applying the model.
Regards,
Balázs0 -
Thanks for your answers ! I will try it in this way you proposed Balázs!0
Answers
-
As you selected only 2000 examples for model building and validation, this is what you get in the confusion matrix. However, since you use cost as a method of model evaluation, you can also use a cost sensitive model to deal with class imbalance, e. g. decision tree. I assume the cost if misclassifying the minority class is high (e. g. positive case, when representing fraud) and the cost of misclassifying the majority class is low (negative case). When cost structure is set up in this way, in model training, the importance of the majority class can be weighed down in favour of the minority class, thus overcoming the problem of class imbalance.1
-
Another way to solve this is moving the sampling *into* the training phase of the cross validation. That way, you're building balanced models, but still validating on all data.
Also, sampling before the validation creates additional "knowledge" for the modeling process that you won't have later when applying the model.
Regards,
Balázs0 -
Thanks for your answers ! I will try it in this way you proposed Balázs!0