Auto Model Rows
Madcap
New Altair Community Member
Hi, I am currently trying to use Auto Model with a data set which has roughly 1300 rows.
When I load the data I can see amount of rows at 1300, in select task it also has 1300 rows, the same in prepare target however when I get the results and choose a certain model, then go into predictions I can only see scoring for around 520 rows.
Is there any reason that about half of the rows are missing or not being displayed? I wondered if it was something to do with editing the model types? Currently I am just using the default setting e.g. Use regularisation, Automatically optimise.
I am currently using an academic license and I checked if it was a row limit but I have unlimited, which makes sense as when I manually make the models I can get results for the 1300 rows.
Thanks for any help you can offer.
-Jason
When I load the data I can see amount of rows at 1300, in select task it also has 1300 rows, the same in prepare target however when I get the results and choose a certain model, then go into predictions I can only see scoring for around 520 rows.
Is there any reason that about half of the rows are missing or not being displayed? I wondered if it was something to do with editing the model types? Currently I am just using the default setting e.g. Use regularisation, Automatically optimise.
I am currently using an academic license and I checked if it was a row limit but I have unlimited, which makes sense as when I manually make the models I can get results for the 1300 rows.
Thanks for any help you can offer.
-Jason
Tagged:
0
Best Answers
-
Hi @Madcap,Glad to hear from you. That behavior is actually what is supposed to happen. We create a 40% hold-out set from your input data to evaluated the model which happens to be those 520 rows. Predictions will be created for those to calculate how well the models work. See this discussion for more details: https://community.rapidminer.com/discussion/54301/auto-model-issueThere is really no point in doing this for the 60% of the data the model was trained on by the way. For more on this, I would recommend this white paper here: https://rapidminer.com/resource/correct-model-validation/Hope this helps,
Ingo1 -
Maybe AutoModel should switch to cross-validation on smaller datasets.
The cross-validation is more accurate in this case. You get a higher number from AutoModel but that doesn't mean that the model is better, it just means that it got lucky when tested on less data.6 -
I would argue that in all cases cross validation is a better performance indicator (in line with the whitepaper Ingo references above). Any split validation sample is always going to be subject to the idiosyncrasies of only a subset of the data and how it is different from the overall sample. It is true that in larger datasets this should diminish in magnitude, but cross-validation eliminates it entirely.2
Answers
-
Hi @Madcap,Glad to hear from you. That behavior is actually what is supposed to happen. We create a 40% hold-out set from your input data to evaluated the model which happens to be those 520 rows. Predictions will be created for those to calculate how well the models work. See this discussion for more details: https://community.rapidminer.com/discussion/54301/auto-model-issueThere is really no point in doing this for the 60% of the data the model was trained on by the way. For more on this, I would recommend this white paper here: https://rapidminer.com/resource/correct-model-validation/Hope this helps,
Ingo1 -
Thanks that makes sense.
Just one final thing, if that is okay, which results would I be inclined to use then? The manual decision tree (with cross validation) which takes into account all the rows or the auto model which takes 40%? The numbers are very similar maybe only 1%-2% difference, with auto model having higher accuracy.
Thanks again
-Jason0 -
Maybe AutoModel should switch to cross-validation on smaller datasets.
The cross-validation is more accurate in this case. You get a higher number from AutoModel but that doesn't mean that the model is better, it just means that it got lucky when tested on less data.6 -
I would argue that in all cases cross validation is a better performance indicator (in line with the whitepaper Ingo references above). Any split validation sample is always going to be subject to the idiosyncrasies of only a subset of the data and how it is different from the overall sample. It is true that in larger datasets this should diminish in magnitude, but cross-validation eliminates it entirely.2
-
Thanks for your help guys.
I will take the cross validation reading then, I am actually looking into RapidMiner for my honours project (dissertation) so all of this advice is really helpful gives me more to write about!
Thanks
-Jason1 -
Yes, consistent with my comments above, I would report the performance results from the cross-validation.0