How to run a prediction model on a dataset without spliting it to train and test datasets

Question

Hello everyone,
I hope this message finds you well. I am currently working on a project that involves running RapidMiner prediction models on a dataset. Specifically, I am interested in using tree induction, SVM, DM, and other models to predict outcomes and determine prediction accuracy. 
However, I am faced with a challenge in that my dataset only contains 60 samples, which makes it difficult to split it into training and testing datasets. Therefore, I am reaching out to you to see if anyone has any suggestions on how I can proceed with running the models without having to split the dataset.
I greatly appreciate any insights or advice you may have on this matter.
Thank you,Mansour

BalazsBaranyRM · Answer

Hi Mansour,

you could also look into "Leave one out" validation. This is a cross validation with as many steps as there are data rows - in your case 60.

The Cross Validation operator has a parameter for switching this on.

This approach will take the first example as the test set and the rest of the data for training, then the second one, and so on. With this method each example will be tested with a model built on the rest of the data and you will get a robust estimate of the model quality.

A final model will be built on all data if you connect the model output of Cross Validation.

Regards,
Balázs

RolandJones · Answer

Hi Mansour,

I would look to use cross-validation instead of using a train/test split. This is a more advanced, iterative technique, that uses folds in the data to both train and test on the entire dataset.

I might suggest starting with 5 folds, based off the size you said, and improve from there. Here’s a good blog on cross validation where you can get a little more information: https://rapidminer.com/blog/validate-models-cross-validation/. I believe we also have a video on the Academy.

Best,
Roland