Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Other ways to Validate results

Hello,

I have a database of 84 rows and 400 attributes, which is a classifier problem. I prepared the Data, that i can exercise the decission tree or other tree models. To evaluate and test the Model i use the performance operator, espacially the accuraccy. I split the Data in a ratio of 80/20. 80% is the trainingset and 20% the testset.

The result of this Model is an accuracy of 80%. When I change the Split type for example from statified to shuffled or the ratio from 80/20 to 70/30, the accuracy drops to 60%. Now my question:

Is this phenomenon normal? Is there any other way to validate a classification model? And probably a bad question which only can be answered by seeing the process: Why does the model accuracy varies so drastically by just the splitting rate or splitting type?

Thanks a lot!

Find more posts tagged with

AI Studio

Accepted answers

varunm1

Hello @dome

Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.

Here is a detailed thread on the working of cross-validation.

https://community.rapidminer.com/discussion/54621/cross-validation-and-its-outputs-in-rm-studio#latest

Hope this helps. Please inform if you need more info.

varunm1

Hello @dome

Here are the reasons when I use stratified or Shuffled.

Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have 15 Class A and 5 Class B samples.

Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.

Now, why stratified and not shuffled?

Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.

Hope this helps

All comments

varunm1

dome

Hi,

Yes, that helps a lot. Thanks!

Another question:

I know the difference between stratified and shuffled sampling. What do I use when? and what should i use in my case? and why?

Thank you!

varunm1