Auto Model Rows

Madcap
Madcap New Altair Community Member
edited November 5 in Community Q&A
Hi, I am currently trying to use Auto Model with a data set which has roughly 1300 rows. 
When I load the data I can see amount of rows at 1300, in select task it also has 1300 rows, the same in prepare target however when I get the results and choose a certain model, then go into predictions I can only see scoring for around 520 rows.

Is there any reason that about half of the rows are missing or not being displayed? I wondered if it was something to do with editing the model types? Currently I am just using the default setting e.g. Use regularisation, Automatically optimise. 

I am currently using an academic license and I checked if it was a row limit but I have unlimited, which makes sense as when I manually make the models I can get results for the 1300 rows. 

Thanks for any help you can offer.
-Jason 

Best Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓
    Glad to hear from you.  That behavior is actually what is supposed to happen.  We create a 40% hold-out set from your input data to evaluated the model which happens to be those 520 rows.  Predictions will be created for those to calculate how well the models work.  See this discussion for more details: https://community.rapidminer.com/discussion/54301/auto-model-issue
    There is really no point in doing this for the 60% of the data the model was trained on by the way.  For more on this, I would recommend this white paper here: https://rapidminer.com/resource/correct-model-validation/
    Hope this helps,
    Ingo
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I would argue that in all cases cross validation is a better performance indicator (in line with the whitepaper Ingo references above).  Any split validation sample is always going to be subject to the idiosyncrasies of only a subset of the data and how it is different from the overall sample.  It is true that in larger datasets this should diminish in magnitude, but cross-validation eliminates it entirely.  

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Answer ✓
    Glad to hear from you.  That behavior is actually what is supposed to happen.  We create a 40% hold-out set from your input data to evaluated the model which happens to be those 520 rows.  Predictions will be created for those to calculate how well the models work.  See this discussion for more details: https://community.rapidminer.com/discussion/54301/auto-model-issue
    There is really no point in doing this for the 60% of the data the model was trained on by the way.  For more on this, I would recommend this white paper here: https://rapidminer.com/resource/correct-model-validation/
    Hope this helps,
    Ingo
  • Madcap
    Madcap New Altair Community Member
    Thanks that makes sense.

    Just one final thing, if that is okay, which results would I be inclined to use then? The manual decision tree (with cross validation) which takes into account all the rows or the auto model which takes 40%? The numbers are very similar maybe only 1%-2% difference, with auto model having higher accuracy.

    Thanks again
    -Jason
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I would argue that in all cases cross validation is a better performance indicator (in line with the whitepaper Ingo references above).  Any split validation sample is always going to be subject to the idiosyncrasies of only a subset of the data and how it is different from the overall sample.  It is true that in larger datasets this should diminish in magnitude, but cross-validation eliminates it entirely.  
  • Madcap
    Madcap New Altair Community Member
    Thanks for your help guys.
    I will take the cross validation reading then, I am actually looking into RapidMiner for my honours project (dissertation) so all of this advice is really helpful gives me more to write about!

    Thanks
    -Jason 
  • Telcontar120
    Telcontar120 New Altair Community Member
    Yes, consistent with my comments above, I would report the performance results from the cross-validation.