Find more posts tagged with
Sort by:
1 - 5 of
51
Hey @Abi,
Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.
As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model.
Scoring typical is real time rather than batch. I assume you mean train, dev/hold-out and test sets ratio. Thumb rule is, If the number of rows is less than 100k it could be 60%,20%,20% or 70%,15%,15%. But if you have 1 million or more rows, it could 98%,1%,1% or even 99.5%,0.4%,0,1%.
As far as reducing the total rows goes, a trick is to train the model on the whole data post your validation of the final model.
@hbajpai ,
Scoring typical is real time rather than batch.
I would challenge you on this. In Customer Analytics its often fine to do scorings once a day / once a week.
Best,
Martin
The ideal ratio is to use cross validation.
There is a reason this is considered the "gold standard" for validation. This approach ensures that 100% of the data is used in both training and testing. Otherwise you are inviting bias from random effects of which records are in your training set vs your testing set.
I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.

I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.
Totally agree with @Telcontar120 on CV. If one cannot afford to implement CV due to time constraints, huge data or specific needs, then other validation similar to AM can be used
Sort by:
1 - 2 of
21
Hello @Abi
70-30 is a general ratio that you find in many processes where split validation is used. I really like the validation used in the Auto model. So, What auto model does is, it train a model on 60% of data and then score on 40% data. The way it scores 40% data is by splitting this 40% into 7 subsets and test on each subset and then average the performance of these 7 subsets. This way it is also having the advantages of cross-validation by splitting into subsets.
My suggestion, go with 60% training (Cross validate) and 40 % testing (divide into 7 or 5 subsets) for scoring. If you can cross-validate whole data, that is fine as well, but test the model on at least 10% hold out data after CV.
70-30 is a general ratio that you find in many processes where split validation is used. I really like the validation used in the Auto model. So, What auto model does is, it train a model on 60% of data and then score on 40% data. The way it scores 40% data is by splitting this 40% into 7 subsets and test on each subset and then average the performance of these 7 subsets. This way it is also having the advantages of cross-validation by splitting into subsets.
My suggestion, go with 60% training (Cross validate) and 40 % testing (divide into 7 or 5 subsets) for scoring. If you can cross-validate whole data, that is fine as well, but test the model on at least 10% hold out data after CV.
The ideal ratio is to use cross validation.
There is a reason this is considered the "gold standard" for validation. This approach ensures that 100% of the data is used in both training and testing. Otherwise you are inviting bias from random effects of which records are in your training set vs your testing set.
I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.

I understand the reasons why AutoModel has chosen to implement a form of split validation, which is primarily to save processing time. That is probably a smart choice for an automated tool like that which is designed to work on pretty much any size data set that users might choose to use with it. It also is potentially doing a lot of other complicated things like feature engineering and feature selection so some corners have to be cut to make the best use of the overall time that users are willing to wait for the output.
However, if you are doing your own process manually and can choose to set it up any way you like, then your default should probably be do to cross-validation and only deviate from that when you have a specific need. If you have tons of data and you are also doing many other complicated things, then perhaps it is better to do split validation. But if you have smaller data sets or more time you can devote to model preprocessing and processing, then cross-validation is really the way to go.
70-30 is a general ratio that you find in many processes where split validation is used. I really like the validation used in the Auto model. So, What auto model does is, it train a model on 60% of data and then score on 40% data. The way it scores 40% data is by splitting this 40% into 7 subsets and test on each subset and then average the performance of these 7 subsets. This way it is also having the advantages of cross-validation by splitting into subsets.
My suggestion, go with 60% training (Cross validate) and 40 % testing (divide into 7 or 5 subsets) for scoring. If you can cross-validate whole data, that is fine as well, but test the model on at least 10% hold out data after CV.