How to deal with high cardinality variables on a Regression problem in RapidMiner

New Altair Community Member

Sep 20, 2019

Updated Nov 5, 2024 by Jocelyn

Hello, I'm working on a Regression problem with a dataset that looks like:

> str(myds)


'data.frame':   841500 obs. of  30 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationYYYY          : Factor w/ 2 levels ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_05                 : Factor w/ 2 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23                 : Factor w/ 5 levels ...

My goal is to predict the target variable: "score".

I'm using R but I also want to use Rapidminer. I think both tools work well together based on what I have read so far.

On the link: http:// mod.rapidminer.com/#app I specified the nature of the dataset displayed above and it recommends me to use KNN for the prediction of the target variable: "score".

My main concern here are high cardinality variables : { "city", "zip" }.

One of the ways to deal with that is by using "Target Encoding" (aka: "Mean Encoding"). But as stated here:

https:// maxhalford.github.io/blog/target-encoding-done-the-right-way/

"The problem of target encoding has a name: over-fitting. Indeed relying on an average value isn’t always a good idea when the number of values used in the average is low. You’ve got to keep in mind that the dataset you’re training on is a sample of a larger set. This means that whatever artifacts you may find in the training set might not hold true when applied to another dataset (i.e. the test set)."

It looks like the way to handle that side effect is the: "Regularization".

I have been using R, and one of the most popular packages to deal with this is: "vtreat" which is used here:

https:// www.r-bloggers.com/vtreat-prepare-data/

For sure that package is awesome, but I think is going to take me a while to be familiar with.

My question is: Can the Rapidminer do "Target Encoding" as well?, doing at the same time: "Regularization"? Probably its the very intuitive UI helps.

Find more posts tagged with

🎉Community Raffle - Win $25

How to deal with high cardinality variables on a Regression problem in RapidMiner

Find more posts tagged with

Quick Links