Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

How can I improve the performance of my model with an imbalanced database for a classification issue

Hi,

This is my fist time using RapidMiner. I have to do a classification for an assignment.
The database is really imbalanced. I have 180 out of 12800 donors who donated (class - 1) in the past and the remaining donors didn't donated (class - 0).

When I created and selected relevant attributes, the class precisions were relevant but the class recall for class 1 was totally irrelevant. I had something close to 8%.

However, when I used the 'Sample' operator to balance my database, the class recall and the class precision were around 60%. I am not sure if it is the right thing to do because at the end, I end up with 360 donors instead of 12 800.

At the end, I have to use a test set of more than 12 000 donors to predict which donor will donate.

Thank you

NB: My kappa is equal to 0.267

Find more posts tagged with

AI Studio

Classification

Databases

Performance

Accepted answers

varunm1

Hello @Samiraaa_123

Whats your kappa value? And what did you apply? Also, there is no guarantee you always get excellent results with ML, some times the data might be random or you may not be tuning your hyperparameters well. You need to keep trying using different models and tuning their hyperparameters and you also need to understand data by checking correlations, distributions.

All comments

varunm1

Hello @Samiraaa_123

How are you validating your model? Is it cross-validation or split validation?

Sampling is good when it is applied to the training set. It is not recommended to apply sampling on the whole dataset. As the dataset is small, you can try upsampling your minority class using SMOTE operator present in the Operator toolbox (Download from Marketplace) instead of downsampling. Also, you can try weighting your examples instead of sampling, this word only for few algorithms like neural networks, decision trees, etc,. Weighting doesn't alter your sample-sizes but assigns equal importance to both classes. This can be done using Generate weights (Stratification). You should check if the algorithm you are trying to use will accept this weighting. That can be found by right-clicking on the algorithm operator and then click on Show operator info. There if you see a green tick after "Weighted Examples" then that algorithm is fine for weighting.

Are your tuning the model's hyperparameters? Are you trying different algorithms?

Samira_123

Hi @varunm1

I used the 'Cross Validation' operator to validate my model. I tried to balance my dataset by using the Generate weights (Stratification) before since I saw this could work on the forum but it says that the 'Random Forest' (operator I am using for the classification) will disregard that.

Does the SMOTE operator need to be placed just before the cross-validation?

Thank you so much for your answer

varunm1

Yep, Random forest doesn't accept weights. You should apply SMOTE or any sampling operators in the training part of the cross-validation. If you apply on whole data, it will bias your model and this model doesn't scale for new data that might come in the future. You can also apply the Optimize Parameter (Grid) to search for good hyperparameter (number of trees, maximal depth, etc.) for the random forest. Also try different models like gradient boosting, neural networks, SVM etc.

Samira_123

Hi @varunm1,

Thank you for your advice

I've been trying to do what you said. My class precision is really good but the class recall for the class 1 is irrelevant.

varunm1

Samira_123

Hi @varunm1,

Thank you. You were very helpful.
Wish you a good weekend