Other ways to Validate results

dome
dome New Altair Community Member
edited November 2024 in Community Q&A
Hello,

I have a database of 84 rows and 400 attributes, which is a classifier problem. I prepared the Data, that i can exercise the decission tree or other tree models. To evaluate and test the Model i use the performance operator, espacially the accuraccy. I split the Data in a ratio of 80/20. 80% is the trainingset and 20% the testset.

The result of this Model is an accuracy of 80%. When I change the Split type for example from statified to shuffled or the ratio from 80/20 to 70/30, the accuracy drops to 60%. Now my question:

Is this phenomenon normal? Is there any other way to validate a classification model? And probably a bad question which only can be answered by seeing the process: Why does the model accuracy varies so drastically by just the splitting rate or splitting type?

Thanks a lot!
Tagged:

Best Answers

  • varunm1
    varunm1 New Altair Community Member
    edited July 2019 Answer ✓
    Hello @dome

    Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.

    Here is a detailed thread on the working of cross-validation.

    https://community.rapidminer.com/discussion/54621/cross-validation-and-its-outputs-in-rm-studio#latest

    Hope this helps. Please inform if you need more info. 
  • varunm1
    varunm1 New Altair Community Member
    Answer ✓
    Hello @dome

    Here are the reasons when I use stratified or Shuffled.

    Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have  15 Class A and 5 Class B samples.

    Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.

    Now, why stratified and not shuffled?

    Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.

    Hope this helps

Answers

  • varunm1
    varunm1 New Altair Community Member
    edited July 2019 Answer ✓
    Hello @dome

    Yes, it is possible. The accuracy is dependent on test data and if test data changes, accuracy changes. This is the reason, we recommend you to use Cross-validation operator, that will split the data into multiple folds (N) and train on N-1 folds and test on the left overfold and this happens till the all the data is trained and tested and you will get reliable performance. As your data set is small, I recommend you use either 3 or 5 folds in CV.

    Here is a detailed thread on the working of cross-validation.

    https://community.rapidminer.com/discussion/54621/cross-validation-and-its-outputs-in-rm-studio#latest

    Hope this helps. Please inform if you need more info. 
  • dome
    dome New Altair Community Member
    Hi,

    Yes, that helps a lot. Thanks!

    Another question:
    I know the difference between stratified and shuffled sampling. What do I use when? and what should i use in my case? and why?

    Thank you!
  • varunm1
    varunm1 New Altair Community Member
    Answer ✓
    Hello @dome

    Here are the reasons when I use stratified or Shuffled.

    Stratified: When my classes are highly imbalanced and I want to have the same proportion of classes in all my folds. For example, if I have a data set of 100 examples with 80 of them belong to Class A and 20 of them belong to Class B. Now, if I use stratified sampling with 5 folds, then each fold will have  15 Class A and 5 Class B samples.

    Shuffled Sampling: This will randomly shuffle your examples and divide into folds of 20 each, they won't be any class balancing in folds.

    Now, why stratified and not shuffled?

    Sometimes, in the case of shuffled sampling, it will create a fold with examples of only one class, to avoid this we use stratified sampling.

    Hope this helps