What about n models generated in cross validation? Should we not take avg of all models (Linear reg)
I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances and I can see that clearly when I use the operator "write as Text".
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
Best Answer
-
Hi,
This is not a funny question at all - I would even go so far and say this is probably one of the most frequently asked questions in machine learning I heard in my life :smileyhappy:
Let me get straight to the point here: Cross validation is not about model building at all. It is a common scheme to estimate (not calculate! - subtle but important difference) how well a given model will work on unseen data. So the fact that we deliver a model at the end (for convenience reasons) might lead you to the conclusion that it actually is about model building as well - but this is just not the case.
Ok, here is why this validation is an approximation of an estimation for a given model only: typically you want to use as much data as possible since labeled data is expensive and in most cases the learning curves show you that more data leads to better models. So you build your model on the complete data set since you hope this is the best model you can get. Brilliant! This is the given model from above. You could now gamble and use this model in practice, hoping for the best. Or you want to know in advance if this model is really good before you use it in practice. I prefer the latter approach ;-)
So only now (actually kind of after you built the model on all data) you are of course also interested in learning how well this model works in practice on unseen data. Well, the closest estimate you could do is a so-called leave-one-our validation where you use all but 1 data points for training and the one you left out for testing. You repeat this for all data points. This way, the models you built are "closest" to the one you are actually interested in (since only one example is missing) but unfortunately this approach is not feasible for most real-world scenarios since you would need to build 1,000,000 models for a data set with 1,000,000 examples.
Here is where cross-validation enters the stage. It is just a more feasible approximation of something which already was only an estimation to begin with (since we ommitted one example even in the LOO case). But this is still better than nothing. The important thing is: It is a performance estimation for the original model (built on all data), and not a tool for model selection. If at all, you could misuse a cross-validation as a tool for example selection but I won't go into this discussion now.
Beside of this: You might have an idea how to average 10 linear regression models - what do we do with 10 neural networks with different optimized network structures? Or 10 different decisions trees? How to average those? In general this problem can not be solved anyway.
You might enjoy reading this older discussion where I spend more time discussion the different options besides averaging: http://community.rapidminer.com/t5/RapidMiner-Studio/Interpretation-of-X-Validation/m-p/9204
The net is: they are all not a good idea at all and you should do the right thing. Which is built one model on as much data as you can and use cross-validation to estimate how well this model will perform on new data.
Hope that clarifies this,
Ingo
2
Answers
-
Hi,
This is not a funny question at all - I would even go so far and say this is probably one of the most frequently asked questions in machine learning I heard in my life :smileyhappy:
Let me get straight to the point here: Cross validation is not about model building at all. It is a common scheme to estimate (not calculate! - subtle but important difference) how well a given model will work on unseen data. So the fact that we deliver a model at the end (for convenience reasons) might lead you to the conclusion that it actually is about model building as well - but this is just not the case.
Ok, here is why this validation is an approximation of an estimation for a given model only: typically you want to use as much data as possible since labeled data is expensive and in most cases the learning curves show you that more data leads to better models. So you build your model on the complete data set since you hope this is the best model you can get. Brilliant! This is the given model from above. You could now gamble and use this model in practice, hoping for the best. Or you want to know in advance if this model is really good before you use it in practice. I prefer the latter approach ;-)
So only now (actually kind of after you built the model on all data) you are of course also interested in learning how well this model works in practice on unseen data. Well, the closest estimate you could do is a so-called leave-one-our validation where you use all but 1 data points for training and the one you left out for testing. You repeat this for all data points. This way, the models you built are "closest" to the one you are actually interested in (since only one example is missing) but unfortunately this approach is not feasible for most real-world scenarios since you would need to build 1,000,000 models for a data set with 1,000,000 examples.
Here is where cross-validation enters the stage. It is just a more feasible approximation of something which already was only an estimation to begin with (since we ommitted one example even in the LOO case). But this is still better than nothing. The important thing is: It is a performance estimation for the original model (built on all data), and not a tool for model selection. If at all, you could misuse a cross-validation as a tool for example selection but I won't go into this discussion now.
Beside of this: You might have an idea how to average 10 linear regression models - what do we do with 10 neural networks with different optimized network structures? Or 10 different decisions trees? How to average those? In general this problem can not be solved anyway.
You might enjoy reading this older discussion where I spend more time discussion the different options besides averaging: http://community.rapidminer.com/t5/RapidMiner-Studio/Interpretation-of-X-Validation/m-p/9204
The net is: they are all not a good idea at all and you should do the right thing. Which is built one model on as much data as you can and use cross-validation to estimate how well this model will perform on new data.
Hope that clarifies this,
Ingo
2 -
I second of course everything what Ingo said. But i would like to add one more punch line:
(Cross-)Validation is not about validating a model but about validating the method to generate a model.
Best,
Martin
1 -
Dear Ingo,
Thank you so much Ingo, This is probably the best explanation I ever had. Now, it all makes sense to me.
You really made my day!
Binay1 -
Thanks Schmitz,
Yes, that totally makes sense to me now.
Binay
0