How to resolve 100% Data accuracy in rapid miner ?? [Urgent]

Question

Hello everyone,

The aim is to catch and predict fraud cases with optimum accuracy based on the dataset provided. For example, cases that are nominated to be fraudulant and turn out to be non fraudulant are not as critical as cases which are predicted to be non fraud and turn out to be.

For this, I wanted to use the Logistic Regression ,Neural Net and Decision Tree for comparison (the work is provided). Whenever I run the models all accuracy is near 100%, surely this is not correct.

I am new to rapid miner and data pre processing, could someone advise me to which direction I should be heading?

lionelderkrikor · Answer

@StudentNeedsHelp

Yes, without Auto-Model, you can use the Performance (Costs) operator to first quantify the cost of a FN and the cost of a FP and  to calculate the final cost of a misclassification.
Please take a look at the process in attached file using your data to experiment and to understand....

Hope this helps,

Regards,

Lionel

Fraud_detection_upsampling.rmp

StudentNeedsHelp · Answer

hi @lionelderkrikor , Thank you for the explanation it makes a lot more sense now. Yes the priority is to correctly predict the fraud and make sure fraud isnt marked as non fraud. I have used the SMOTE Upsampling now on the Logical Regression with non negative coefficients. The accuracy has dropped down to around 97-98%. Is there a way I can quantify both false negative and positives without using the automodel? second model, the neural network is still displaying imbalance and I am confused as to how to find the rare class responsible.

thanks

lionelderkrikor · Answer

Hi@StudentNeedsHelp,

Given that your dataset is highly imbalanced (there is much more "non fraudulant" than "fraudulent" cases in your dataset)
that's why the model has difficulties to establish the relationship between your features and the minority class of your label ("fraudulent")
and in fine the model is considering all the your transaction as "non fraudulent" that's why you have an accuracy near from 100%.
I think that in your case a better performance indicator is the "class recall". You want in priority correctly predict the fraudulant cases , isn't it ?
For that you have to upsample your initial dataset by increasing the number of examples of "fraudulent" cases by using for example the
SMOTE Upsampling operator. This way, you will increase the class recall of the fraudulent cases.

Ideally, you can use Auto-Model after the upsampling operator and define the cost matrix at the "prepare target" scrreen (typically you "quantify" cost of a misclassifcation of "False negative" and the cost of a misclassificartion of a "false positive" ).
Auto-Model will be executed to minimize the cost of a misclassification and in fine to maximize the gain...

Hope this helps,

Regards,

Lionel