Auto Model and overfitting

dgarrard
dgarrard New Altair Community Member
edited November 2024 in Community Q&A

I've been experimenting with Auto Model for Prediction and am generally happy with the concept and results.  

 

In the Auto Model process the sampling is set to 80/20.   Is this sufficient to control potential overfitting?  I am getting performance ranging from about 60% accuracy for Naive Bayes to 87% accuracy for GBT. I have less than 1000 rows of data and 20 attributes for each data set.  GBT is generating about 20 trees.  (I would potentially be operationalising with 100's of datasets and dedicated models per dataset) 

 

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • sgenzer
    sgenzer
    Altair Employee

    hello @dgarrard - I think it is always prudent to be on alert for overfitting, regardless whether it's using Auto Model or using the "normal" RapidMiner methods. We all know that some models such as neural networks are prone to overfitting and should be used with caution, particularly on small data sets.

     

    My personal opinion is that the 80/20 split is widely used and is, in general, a reasonable split ratio and should be sufficient to avoid overfitting if used in conjunction with methods such as cross-validation (which is default in Auto Model).

     

    In the end, I always look at results with skepticism irrespective of the tool used until I actually inspect them to see how my "fit" looks on unseen data.

     

    Hope that helps.


    Scott

     

  • dgarrard
    dgarrard New Altair Community Member

    Thank you for the quick reply Scott.  I'll try to get some testing done in the next couple weeks while my Auto-Model trial is still available!

     

    David 

  • tkaiser
    tkaiser New Altair Community Member

    Hi this is very helpful, thank you. But i do have a follow up question...is the auto model showing a testing set accuracy or a training set accuracy in the results view? Because I ran a GBT in auto model on 4500 lines of data with 15 features, received "accuracy" of 90% and f-measure of 84%, but when i applied the model to new unseen data (which i actually purposely held-out from the training and cross validation process), the accuracy rates declines to below 50%. So I am not sure if I am running the validation process incorrectly, or perhaps not understanding what the results of the CV are telling me - as I had expected the auto model to produce an accuracy rate that was reflective of how well the model will perform in the future. Thanks much. 

  • IngoRM
    IngoRM New Altair Community Member

    Hi,

     

    sorry for the delay, I missed this one here.  It shows the testing error of course.  If you read my correct validation opus linked above, you will see that we would NEVER care about training errors in the first place ;-)

     

    Such a drop can either be caused by a (significant) change in data distributions between training and validation sets.  Or, what I personally find more likely given the high amount, you probably did not apply exactly the same data preparation on your validation set.  More about this in the other thread here:

     

    https://community.rapidminer.com/t5/RapidMiner-Auto-Model-Turbo-Prep/Is-auto-model-showing-test-or-train-error/m-p/50902/highlight/false#M117

     

    Hope this helps,

    Ingo

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.