Hi,
I do have some problems understanding how the decision tree algorithm works in combination with cross validation. Another user on Stack Overflow apparently had the very same question, which I could not put in a better way. So I appologize for simply copy pasting it:
http://stackoverflow.com/questions/2314850/help-understanding-cross-validation-and-decision-trees[quote author=chubbard]I've been reading up on Decision Trees and Cross Validation, and I understand both concepts. However, I'm having trouble understanding Cross Validation as it pertains to Decision Trees. Essentially Cross Validation allows you to alternate between training and testing when your dataset is relatively small to maximize your error estimation. A very simple algorithm goes something like this:
1. Decide on the number of folds you want (k)
2. Subdivide your dataset into k folds
3. Use k-1 folds for a training set to build a tree.
4. Use the testing set to estimate statistics about the error in your tree.
5. Save your results for later
6. Repeat steps 3-6 for k times leaving out a different fold for your test set.
7. Average the errors across your iterations to predict the overall error
The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick? One idea I had was pick the one with minimal errors (although that doesn't make it optimal just that it performed best on the fold it was given - maybe using stratification will help but everything I've read say it only helps a little bit).
As I understand cross validation the point is to compute in node statistics that can later be used for pruning. So really each node in the tree will have statistics calculated for it based on the test set given to it. What's important are these in node stats, but if your averaging your error. How do you merge these stats within each node across k trees when each tree could vary in what they choose to split on, etc.
What's the point of calculating the overall error across each iteration? That's not something that could be used during pruning.
[/quote]
The posted answers don't really answer the question for me. First of all, to my understanding, one should not use the same data for training and testing. So use 100% of your data for training and then test the model using the same data is a no go. Secondly, when you put a decision tree learner in the left (training) part of a cross validation operator, it should indeed create a (possibly different) model for each iteration. So, the question remains: Which tree is chosen in the end (the one you see when you choose to output the model)? Or is this in fact some kind of average model?
Or is the idea of X validation not to actually create a model, but rather see, how a certain learner with certain parameters would perform with your data? In such a case, however, I don't understand how you would build the actual model. If you use 100% of your data for the training, you would not have any test data left for post-pruning and prevention of overfitting ...
Thank you very much for shedding some light on this topic and best regards
Hanspeter