What model should I use ( training, validation or testing )
cliftonarms
New Altair Community Member
I am seeking a little "best" advice on the live prediction model application, as I am a little confused what approach is normally adopted.
The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.
The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10% of unseen data.
My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :
1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model with the 90% training data ( without x fold ) using the best settings found from the x-validation training.
3) Do I create a model on 100% data ( 90% training and 10% unseen ) with the best settings found from training.
Thank you in advance for your time.
The data : My data set is 50 attributes and 3400 rows ( 90% for training, 10% for unseen testing) with the very last row reserved as the live prediction example.
The training : I use the 90% training data in 10 fold x-validation to find the best training algorithm and attribute mix for my data. Confirming the best setup selection by applying the model created on the 10% of unseen data.
My question is - Once I am happy with the above results, what model do I use ( or create ) for the live prediction of the last row? :
1) Do I use the best model created via 90% data 10 fold x-validation
2) Do I create a model with the 90% training data ( without x fold ) using the best settings found from the x-validation training.
3) Do I create a model on 100% data ( 90% training and 10% unseen ) with the best settings found from training.
Thank you in advance for your time.
Tagged:
0
Answers
-
With datasets that small my advice would be to go with (1) select based on X-validation.
With a large dataset you could go with (2) select based on training/test. You can do without X-validation here.
Whatever you pick Don't do (3) ever as you face the risk of over-fitting the data badly.
There are some authors who recommend splitting the dataset into training/test/validation. Train your models in the training set. Compare the models in the test set. Pick the best. Estimate the error rate of the best model again in the validation set.0 -
Thanks for the quick response earmijo.
Can I just confirm - you are advocating using the "best" model created by the 10 fold x-validation method, and not retraining the model using the "best" model settings but on the complete data set.0 -
The way X-Validation works in RapidMiner is you use X-validation to estimate the "out-of-sample" error but you report the model trained on the entire dataset. Notice, for instance, when you use 10-fold X-validation the model is estimated 11 times.0
-
Fantastic - I understand - Thank you for your advice.0
-
Hi,
I have a question. Do you apply the trained model with the model applier right after the XValidation or do you have to train again over the whole training set after having applied the XValidation? I am asking because in case you do a Feature selection with an inner XValidation, You don't get a model out of the feature selection (there is no connection point). However you could save the model with a "remember operator" inside the FS and call the model outside the FS operator and combine it with the feature weights operator for the unseen testset. But I think one has to retrain over the full training set with the selected features right?0