Cross Validation with Smote Upsampling

Question

Hi all,
I see that there are already some discussions in this community about this subject. However I still have some doubts.

I have a process, in which there is a class imbalance and the minority class is the most important. SMOTE upsampling seems to provide good results. I say "seems" because I have doubts on how to correctly validate it.

My approach was to train the model with upsampled data and test the model with 20% hold out (partitioned before upsampling).

I guess that this is the most correct thing to do 'cause real data is not upsampled. But what is the most correct way to validate the model? I used the 20% hold out in the testing part of CV operator (using remember and recall).

What are your thoughts?
Please trash my approach if you think so :smile:

(enclosed a mock example data set and RM process file)

Thanks,
Pedro

Telcontar120 · Accepted Answer

Rather than using the holdout approach, I would recommend putting your SMOTE upsampling inside your cross-validation itself. The problem with your approach is that the results are highly dependent on the initial sample, which is only drawn once. See the revised process attached. Your process won't actually run because you didn't set the label but once you do that you should be able to compare your original process to my revised version.