Hello,
I've been noticing a phenomenon I don't quite know how to explain that is somehow related to what is described in this
previous post. I try training a linear regression model on a dataset (we can consider the Polynomial dataset for instance, with 200 examples), using cross-validation (with shuffled sampling): on the training side, there is simply a linear regression with the default parameters; on the testing side, an apply model and a performance evaluation. What I'm doing is trying to change the number of folds the model gets trained on. Here are some values I observed with that dataset:
- 5-fold CV: correlation = 0.894 +/- 0.026 (micro average: 0.892)
- 10-fold CV: correlation = 0.902 +/- 0.038 (micro average: 0.891)
- 20-fold CV: correlation = 0.909 +/- 0.080 (micro average: 0.894)
- 50-fold CV: correlation = 0.899 +/- 0.174 (micro average: 0.894)
- 100-fold CV: correlation = 0.960 +/- 0.197 (micro average: 0.894)
- 150-fold CV: correlation = 0.300 +/- 0.460 (micro average: 0.894)
- 200-fold CV: correlation = 0.000 +/- 0.000 (micro average: 0.894)
So, clearly at first increasing the number of folds means more data is getting seen to train the model, so it makes sense that the performance may increase slightly until 100-fold. What confuses me more is what happens after: I agree it doesn't quite make sense to do 150 folds, because I'm not sure how one divides 200 into 150 folds - I assume there may be some repetition in the training sets? Still, I'd expect a warning telling me it's potentially problematic, but I don't quite get why the correlation value collapses. And finally, at 200-fold, which is equivalent to a leave-one-out CV, the correlation value is at 0.
So does it mean the "best" value for the number of folds in that case is half the number of examples in the dataset? If so, why is that? Or should I only rely on the micro-averages which are pretty stable?