Downsampling operators

20160041
New Altair Community Member
Hi,
Could you please tell me how I can achieve downsampling with imbalanced data in RM. I have used the random sampling and sampling bootstrap operators would also like to know the difference between the two.
Thank you
Could you please tell me how I can achieve downsampling with imbalanced data in RM. I have used the random sampling and sampling bootstrap operators would also like to know the difference between the two.
Thank you
Tagged:
0
Best Answers
-
Hi,
In the Mannheim Toolbox extension, there is a Sample - Balance operator that does just this.
(Opinions and fundamental techniques aside, but you might want to work with weighting instead of sampling.)
All the best,
Rodrigo.1 -
Hi @20160041,
The different Sample operators gives you also the possibility to downsample (or upsample) your data set. The Sample operator just randomly draw (drawing without replacement) a number of Examples. By default it does not depend on the class which Examples are drawn, so the (possible imbalanced) class ratio will be the same (with some random variations) after drawing. You can change this by selecting 'balance data' and draw different numbers of Examples per class. If your want to force your ratio to 1.0 you can set the sample size for both classes to the same number.
Sample (Stratified) will always sample in the way the the class ratio is kept.
Sample (Bootstrapping) is drawing with replacement, so there is a possibility that a specific Example occur multiple times after sampling. This can be helpful to upsample a class from which you have only a smaller number of Examples.
Hopes this helps with the differences of the Sampling operators.
Two other things I would like to mention:
In most cases I would try to not downsample your data for a machine learning task. You remove information which your model could be using for finding patterns. You may want to switch to another model instead. There are a few reasons for downsampling:
- Runtime problems
- I you have an extremely large number of Examples for one class (say a class ratio of 20:1 or higher)
If you want to get rid of your imbalanced class ratio, you may also want to try the SMOTE operator from the Operator Toolbox Extension. It performs an (advanced) method for upsampling your underrepresented class.
Best regards
Fabian5 -
I second the idea that weighting is my preferred approach, and that downsampling should be used primarily when you have many more cases than needed (either in general, or specifically of the majority class). There are diminishing returns to larger and larger samples, so if your development population is hundreds of thousands of cases then you likely don't need them all. But if you have an absolutely small number of your minority class then you probably don't want to downsample the majority class to match it as too much information would be lost.1
Answers
-
Hi,
In the Mannheim Toolbox extension, there is a Sample - Balance operator that does just this.
(Opinions and fundamental techniques aside, but you might want to work with weighting instead of sampling.)
All the best,
Rodrigo.1 -
Hi @20160041,
The different Sample operators gives you also the possibility to downsample (or upsample) your data set. The Sample operator just randomly draw (drawing without replacement) a number of Examples. By default it does not depend on the class which Examples are drawn, so the (possible imbalanced) class ratio will be the same (with some random variations) after drawing. You can change this by selecting 'balance data' and draw different numbers of Examples per class. If your want to force your ratio to 1.0 you can set the sample size for both classes to the same number.
Sample (Stratified) will always sample in the way the the class ratio is kept.
Sample (Bootstrapping) is drawing with replacement, so there is a possibility that a specific Example occur multiple times after sampling. This can be helpful to upsample a class from which you have only a smaller number of Examples.
Hopes this helps with the differences of the Sampling operators.
Two other things I would like to mention:
In most cases I would try to not downsample your data for a machine learning task. You remove information which your model could be using for finding patterns. You may want to switch to another model instead. There are a few reasons for downsampling:
- Runtime problems
- I you have an extremely large number of Examples for one class (say a class ratio of 20:1 or higher)
If you want to get rid of your imbalanced class ratio, you may also want to try the SMOTE operator from the Operator Toolbox Extension. It performs an (advanced) method for upsampling your underrepresented class.
Best regards
Fabian5 -
I second the idea that weighting is my preferred approach, and that downsampling should be used primarily when you have many more cases than needed (either in general, or specifically of the majority class). There are diminishing returns to larger and larger samples, so if your development population is hundreds of thousands of cases then you likely don't need them all. But if you have an absolutely small number of your minority class then you probably don't want to downsample the majority class to match it as too much information would be lost.1