Newbie question: XValidation
nicugeorgian
New Altair Community Member
Hi,
For a cross validation process with, e.g., XValidation, the example set S is splitt up into, say, 3 subsets: S1, S2, and S3.
The inner operator of XValidation is applied then
once with S1 as test set and S2U S3 as training set, and then
once with S2 as test set and S1U S3 as training set, and then
once with S3 as test set and S1U S2 as training set.
For each of these runs, a model is returned.
My question is how to decide in general what model is the best? Or there is no best model ...
Thanks,
Geo
For a cross validation process with, e.g., XValidation, the example set S is splitt up into, say, 3 subsets: S1, S2, and S3.
The inner operator of XValidation is applied then
once with S1 as test set and S2U S3 as training set, and then
once with S2 as test set and S1U S3 as training set, and then
once with S3 as test set and S1U S2 as training set.
For each of these runs, a model is returned.
My question is how to decide in general what model is the best? Or there is no best model ...
Thanks,
Geo
Tagged:
0
Answers
-
Hi Geo,
this is actually a question we got most often during the last years and there seem to be a lot of misunderstanding in properly evaluating models with cross validation techniques. The answer is as simple as this: none of the models created for the single folds is the best. The best one is the one trained on the complete data set or on a well chosen sample (it is not the task of a cross validation to find such a sample).
If you ask which is the best one I would ask "What should be the best model?" The one with the lowest error on the corresponding test set? Well, this would be again like overfitting but now not on a training but on a test set. So it is probably not a good idea to simply select a model because of the test error alone.
The best thing one can do is to think of cross validation as a process which is completely independent of the learning process:
1) One process is the learning process which is performed on the complete data.
2) But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.
Hope that makes things a bit clearer. Cheers,
Ingo0 -
Ingo, many thanks for the very detailed answer!
I have somehow anticipated your answer when I wroteOr there is no best model ... 0 -
Hi,
I already though that but we get this question so often that I thought a longer answer might be a good idea so that we can post a link here in the future
I have somehow anticipated your answer when I wrote
Cheers,
Ingo0 -
Creating a good model is a tricky business.
By using too much data, too little, or comparing and optimising your model on different dataset (bootstrapping), you run the risk of overfitting your model.
Taking a set of cases 'C' and a model 'M', the Coefficent of Concordance (CoC) is an indication on how good a model can distinguish cases into the defined catagories. [M.G. Kendall (1948) Rank correlation methods, Griffin, Londen]
When the CoC of a model is 50%, you actually have a random model (below 50%, your model is "cross-wirde"), so 50% is the lowest CoC you will get.
Accuracy is a measure that indicates the number of mismatched cases of your model in comparenson to the total amount of cases, this is different than your CoC.
These two measures (CoC and Accuracy) determin how good a model is.
For instance, when we sort the scored cases by the outcome predicted and the actual outcome:
....BBBBBBBBBBBBBBBBBB|GGGGGGGGGGGGGGGGGG.... Here we have 100% CoC
the accuracy is determined by the number of cases that are actually scored correctly
....BBBBBBBBB|GBBGBGGBBBGGBGGGB|GGGGGGGGGG.... Here is a more realistic picture, naturaly CoC is below 100%
Now by determining stratigacly where you will place your cutt-off, the accuracy can be determined.
If you place your cut-off higher, you take a lower risk, and your accuracy will be high
Accepting more risk, with a lower accuracy, you will place your cut-off lower, allowing yourself a bigger market share.
0 -
Hello
...reviving an old discussion...
My question is (since I am currently checking the possibilities to validate a ranking classifier without applying a cutoff / threshold) why anyone should bother to use the CoC ? It is much easier to calculate the sum of the ranks of the TP. This value can be easily transformed to the [0,1]-interval (e.g. 1 = optimal ranking, 0 = worst ranking).
I know that the CoC is the value of teststatistics for the Kendall CoC-test, so a statistical test can be applied. But this test is only meant to notify whether there is any difference (in agreement), just like ANOVA. I am looking for a test for multiple comparisons to know WHERE the difference occurs (e.g. Tukey-Test). The only test I found for this case is Friedman Ranksumtest.
another one:
[quote author=mierswa]
But now you also want to now how good your model will perform if it is employed on completely unseen data. This is where the second process comes into the game: estimating the predictive power of your model. The best estimation you could get is calculated with leave-one-out (LOO) where all but one examples are used for training and only the remaining one for the test. Since almost all examples are used for training, the model is the most similar one compared to the model trained on the complete data. Since LOO is rather slow on large data sets, we often use a k-fold cross validation in order to get a good estimation in less time.
[/quote]
Hm,hm. Recently I read a very interesting Phd-Thesis from Ron Kohavi (Click), who has shown that LOO reduces the variance, but increases the bias (i.e. stability). Imagine a binary classification problem with 50% of all instances got label=1 and 50% got label=0. Now apply a majority classifier. Using LOOC the accuracy will be zero.
However, Kohavi concludes that it is best to apply 6-10-fold-crossvalidation and repeat it 10-20 times to reduce variance. Note that repeating CV increases the alpha-error if you plan to use statistical tests to validate the results.
...so we returned to the suggestion that 10-fold-cv is the best procedure you can use . I just wanted to accentuate the argument...
greetings,
Steffen
0 -
Hi Steffen,
yes, I know Ron's thesis and this is actually a good point. So my explanation about LOO might be a bit misleading. Anyway, I just wanted to give the readers a feeling how error estimation with any cross-validation-like process and a model learned from the complete data are connected. The reason for this explanation was quite simple: it's probably one of the most often asked questions - or at least it used to be some time ago.
Cheers,
Ingo0