Resampling / oversampling with holdout sample.
kasper2304
New Altair Community Member
Hi guys.
I have a question regarding resampling / oversampling i combination with the use of a holdout sample
My dataset is the following:
Positive cases: 337
Negative cases: 2661
What i did until now was:
1) Sample 337 positive cases and sample 1500 negative cases
2) Then i filter 0's in on node and filter 1's in another node
3) I use sample bootstrapping one the 1's with a factor of 4.451 giving me 1500 positive cases.
4) I append the datasets
5) I am ready to model
Now I want to use a holdout sample as my linear SVM seems to be overfitting. 90-95% accuracy.
What i consider the right thing, is to extract lets say 37 positive cases and 37 negative cases to use for validation BEFORE upscaling the minority class. this leaves me with a holdout sample on evenly distributed 74 (i know it is small, but i am mining text so I need my training cases). It also leaves me with a training and test set on 300/1500 which i can upscale to 1500/1500 cases.
My SVM predicts almost all the negative cases correctly and 2/3 of the positive cases if i use feature extraction on the hold out sample.
What are you thoughts?
Are there other ways to use holdout sample in rapidminer?
I have a question regarding resampling / oversampling i combination with the use of a holdout sample
My dataset is the following:
Positive cases: 337
Negative cases: 2661
What i did until now was:
1) Sample 337 positive cases and sample 1500 negative cases
2) Then i filter 0's in on node and filter 1's in another node
3) I use sample bootstrapping one the 1's with a factor of 4.451 giving me 1500 positive cases.
4) I append the datasets
5) I am ready to model
Now I want to use a holdout sample as my linear SVM seems to be overfitting. 90-95% accuracy.
What i consider the right thing, is to extract lets say 37 positive cases and 37 negative cases to use for validation BEFORE upscaling the minority class. this leaves me with a holdout sample on evenly distributed 74 (i know it is small, but i am mining text so I need my training cases). It also leaves me with a training and test set on 300/1500 which i can upscale to 1500/1500 cases.
My SVM predicts almost all the negative cases correctly and 2/3 of the positive cases if i use feature extraction on the hold out sample.
What are you thoughts?
Are there other ways to use holdout sample in rapidminer?
Tagged:
0