New guy ... help interpreting data
I need some help interpreting the output of auto-model. I have a true/false label with 600/11000 values and approximately 12,000 examples. At first glance, the random forest is more accurate, but then the AUC is much higher for the gradient Boosted Trees, and the precision points at decision trees and random forest. I am not an expert in statistics and I would much appreciate if someone can break this down for me and tell me if any of the predictions are statistically meaningful and how I go about determining that.
Thank you!
Model | Accuracy (%) | Classification Error (%) | AUC | Precision (%) | Recall (%) | F-Measure (%) | Sensitivity % | Specificity (%) |
Naive Bayes | 93.1 | 6.9 | 0.859 | 26.7 | 23.0 | 24.7 | 23.0 | 96.7 |
Generalized Linear Model | 94.8 | 5.2 | 0.855 | 40.0 | 13.1 | 19.8 | 13.1 | 99 |
Logistic Regression | 94.7 | 5.3 | 0.848 | 37.2 | 13.1 | 19.4 | 13.1 | 98.9 |
Deep Learning | 93.5 | 6.5 | 0.867 | 31.9 | 29.5 | 30.6 | 29.5 | 96.7 |
Decision Tree | 95.2 | 4.8 | 0.500 | 100.0 | 1.6 | 3.2 | 1.6 | 100 |
Random Forest | 95.3 | 4.7 | 0.739 | 100.0 | 3.3 | 6.3 | 3.3 | 100 |
Gradient Boosted Trees | 94.6 | 5.4 | 0.915 | 40.6 | 21.3 | 28.0 | 21.3 | 98.4 |
Answers
-
You can't focus on accuracy here because your dataset is so imbalanced, it is easy to achieve high accuracy by simply predicting the majority class. In fact, you should probably consider either weighting or sampling to address your class imbalance because it is almost certainly influencing your models.
AUC is a much better measure of model performance when you have an imbalanced class distribution, so by that measure the GBT is indeed the best performing model. It is noteworthy that the very simply Naive Bayes is also performing quite well here, and that might be a good starting place or baseline model.
The question of statistical significance is one that is laden with theoretical baggage. The short answer to your question is that all of these models other than the decision tree are giving you some kind of discriminatory tool to use regardless of your theoretical perspective on p-value interpretations (frequentist or Bayesian). Modern machine learning does not heavily emphasize the calculation or role of p-values, unlike the classic statistical approach; instead, it relies on cross-validation performance (you did use cross-validation, didn't you?) to understand model usefulness.
1 -
I am not sure I follow. How is the dataset imbalanced? Let's assume that 600 people died and 11000 survived train crashes. I have approximately 50 data points that describe train and people. I have 12,000 crashes and I am trying to predict the likelihood of a passenger dying based on the 50 data points. Are you saying that the AUC is best at describing the rate of survival? Why is that? I used the auto model. The data is straight out of the auto model.
0 -
Hi!
The AUC is not best at describing the rate of survival. It is a performance measure for comparing models. It gives you an idea about the quality of your models.
Imbalanced means that the two values of your label are not split 50:50 but more like 5:95. A simple model (maybe the decision tree) just tells you "Hey, everyone survives" and is 95 % right (because 95 % are actually surviving) but it's not a good model. You see this in the low AUC value.
A good model actually works on distinguishing died and survived and will have a higher AUC. (Take a look at the different AUC curves.)
You might get an even better model by e. g. downsampling the survived class. The most correct way is to do this inside the cross validation's training phase.
It depends on what your goal is. If death is the most important class, and you'd like to have a higher recall of death cases from your model, you could weight these even higher. Then you'll get a model that makes more mistakes on survivors (predicting them as dead) but will catch most of the deaths. If your use case is analyzing possible reasons for deaths and avoiding them, this might be your way to achieve it.
Regards,
Balázs1 -
As Balazs says, the rate of survival is not the key issue here because most people survive. You want to build a model that can help you find out what the key factors are associated with death vs survival, or a way to separate the two classes. For that, AUC is probably the best performance measure for your model.1
-
Hey guys,
I am dealing with the same problem at the moment and would like to use SMOTE to upsample. Regarding this, I have 2 questions.
1. I have not only 1 attribute that I want to predict true or false, but 10 (different fault categories that should be predicted for each Event). Some of them are 2/3 balanced, some are 5:95. Is it correct to upsample every Attribute seperatly ? My data base would then be around 7 times bigger than before. Or is there another way?2. @BalazsBarany you said that you would do the up/downsampling inside the cross validation's Training. I thouhgt that you would do it as part of the feature generation process and do the modelling process after that.
If you need more infomation or this does not fit here, I will make a different post and maybe add some Information.
Thank you for your help.
Regards,
Tobias0 -
If you have 10 different attributes to predict, you are going to need to sample 10 different times and set each attribute as a label to build 10 different models.
The reason why sampling is done inside cross-validation is to determine the impact that sampling (which involves some quasi-random processes) has on model performance, which can be significant. Ultimately you may want to do it on the entire dataset when building your final model, but for understanding performance, doing it inside the CV will give you the least biased performance.
0 -
Also, sampling in the cross validation training phase validates the whole data set. If you downsample before the validation, you lose valuable data for validation.2
-
As your dataset is biased, you can either use kappa values which is an inter-rater agreement and Root mean square error for a better understanding of the performance. 5 fold cross validation is recommended in this case. I suggest you be careful in downsampling as in the real world we need to deal with this sort of data.2