"Problem interpreting RIPPER (JRIP) classification accuracy"
wj
New Altair Community Member
Hi,
I'm new to this forum and also quite new to RapidMiner, and I have a question to which I haven't found answer from manuals or forums. I apologize if this is too trivial question for this forum, but this is quite important issue to me.
I use the ripper (w-jrip) algorithm and RapidMiner 4.6 to find classification accuracy and rulesets for a dataset, but I'm not quite sure how to interpret the output of classification accuracy / accuracy given by the j-rip ruleset.
If I look at the "Performance vector" tab which contains the confusion matrix and accuracy, I suppose the accuracy value is the mean accuracy obtained in the cross-validation process? And the sensitivity and specificity can be calculated from the confusion matrix, which shows mean values of true/false positives and negatives obtained by the validation process, is this correct? The thing that confuses me is the tab "W-JRip" which contains the ruleset that can be used to classify the subjects into groups A and B. Is this some kind of optimal ruleset that had the best classification accuracy in some iteration of the validation process? If I apply the ruleset to the dataset I always get better classification accuracy/sensitivity/specificity compared to the values in "Performance vector" -tab. The thing that worries me is that the accuracy given by the j-rip ruleset differs sometimes even 20 precentage points from the accuracy displayed in "performance" tab. Can someone explain how is this ruleset obtained by the software and which accuracy of the two (ruleset or the one in performance-tab) is more reliable / should be used? Thank you for help!
Just for information my dataset (about 100 subjects) has groups A and B and approx. 10 variables.
I'm new to this forum and also quite new to RapidMiner, and I have a question to which I haven't found answer from manuals or forums. I apologize if this is too trivial question for this forum, but this is quite important issue to me.
I use the ripper (w-jrip) algorithm and RapidMiner 4.6 to find classification accuracy and rulesets for a dataset, but I'm not quite sure how to interpret the output of classification accuracy / accuracy given by the j-rip ruleset.
If I look at the "Performance vector" tab which contains the confusion matrix and accuracy, I suppose the accuracy value is the mean accuracy obtained in the cross-validation process? And the sensitivity and specificity can be calculated from the confusion matrix, which shows mean values of true/false positives and negatives obtained by the validation process, is this correct? The thing that confuses me is the tab "W-JRip" which contains the ruleset that can be used to classify the subjects into groups A and B. Is this some kind of optimal ruleset that had the best classification accuracy in some iteration of the validation process? If I apply the ruleset to the dataset I always get better classification accuracy/sensitivity/specificity compared to the values in "Performance vector" -tab. The thing that worries me is that the accuracy given by the j-rip ruleset differs sometimes even 20 precentage points from the accuracy displayed in "performance" tab. Can someone explain how is this ruleset obtained by the software and which accuracy of the two (ruleset or the one in performance-tab) is more reliable / should be used? Thank you for help!
Just for information my dataset (about 100 subjects) has groups A and B and approx. 10 variables.
Tagged:
0
Answers
-
Hi,
you are correct with your assumptions about the confusion matrix: It counts thel classification outcome of each single example.
In RapidMiner 4.6 there was a parameter "create_complete_model" that will train a model on the complete training data, it is not evaluated, but usually models perform better if trained on more data.
If you apply it on your data set, the performance will be much better, since you are applying the model on the data you have trained it on! In extreme cases, the model might have simply remembered all training data and hence could reach a 100% performance.
I would suggest to move to RapidMiner 5.0, since it is especially helpful for beginners and will lower the learning curve much.
Greetings,
Sebastian0 -
Thank you Sebastian!
This is how I figured it might be. Although the optimal ruleset found is not tested by cross-validation, isn't there some kind of inherent validation is the ripper-algorithm? At least what I understood from the paper where the algorithm is described (Cohen 1995), the dataset is split to two sets: set A to train an overfitted ruleset and set B which is used to prune the ruleset to minimize the classification error on unseen data. Rules I have got this way seem to be quite logical and simple (which means that the possibility of overfitting is quite low). Although I have noticed that depending on data, the accuracy of these optimal rules is usually little lower when they are applied to completely different dataset. This is probably due to the heavy variation in the data I'm studying. The thing is that it would be extremely desirable to find optimal and fixed rulesets that would generalize well in the environment and problems i work with (medicine).
0 -
Hi,
you are correct, the ripper algorithm does perform some testing itself, that prevents the algorithm from totally overfitting, but it's not a valid performance estimation.
Greetings,
Sebastian0