🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

How to deal with high cardinality variables on a Regression problem in RapidMiner

User: "tlg265"
New Altair Community Member
Updated by Jocelyn
Hello, I'm working on a Regression problem with a dataset that looks like:
> str(myds)

'data.frame': 841500 obs. of 30 variables: $ score : num 0 0 0 0 0 0 0 0 0 0 ... $ amount_sms_received : int 0 0 0 0 0 0 3 0 0 3 ... $ amount_emails_received : int 3 36 3 12 0 63 9 6 6 3 ... $ distance_from_server : int 17 17 7 7 7 14 10 7 34 10 ... $ age : int 17 44 16 16 30 29 26 18 19 43 ... $ points_earned : int 929 655 286 357 571 833 476 414 726 857 ... $ registrationYYYY : Factor w/ 2 levels ... $ registrationDateMM : Factor w/ 9 levels ... $ registrationDateDD : Factor w/ 31 levels ... $ registrationDateHH : Factor w/ 24 levels ... $ registrationDateWeekDay : Factor w/ 7 levels ... $ catVar_05 : Factor w/ 2 levels ... $ catVar_06 : Factor w/ 140 levels ... $ catVar_07 : Factor w/ 21 levels ... $ catVar_08 : Factor w/ 1582 levels ... $ catVar_09 : Factor w/ 70 levels ... $ catVar_10 : Factor w/ 755 levels ... $ catVar_11 : Factor w/ 23 levels ... $ catVar_12 : Factor w/ 129 levels ... $ catVar_13 : Factor w/ 15 levels ... $ city : Factor w/ 22750 levels ... $ state : Factor w/ 55 levels ... $ zip : Factor w/ 26659 levels ... $ catVar_17 : Factor w/ 2 levels ... $ catVar_18 : Factor w/ 2 levels ... $ catVar_19 : Factor w/ 3 levels ... $ catVar_20 : Factor w/ 6 levels ... $ catVar_21 : Factor w/ 2 levels ... $ catVar_22 : Factor w/ 4 levels ... $ catVar_23 : Factor w/ 5 levels ...
My goal is to predict the target variable: "score".

I'm using R but I also want to use Rapidminer. I think both tools work well together based on what I have read so far.

On the link: http:// mod.rapidminer.com/#app  I specified the nature of the dataset displayed above and it recommends me to use KNN for the prediction of the target variable: "score".

My main concern here are high cardinality variables : { "city", "zip" }.

One of the ways to deal with that is by using "Target Encoding" (aka: "Mean Encoding"). But as stated here:

https:// maxhalford.github.io/blog/target-encoding-done-the-right-way/

"The problem of target encoding has a name: over-fitting. Indeed relying on an average value isn’t always a good idea when the number of values used in the average is low. You’ve got to keep in mind that the dataset you’re training on is a sample of a larger set. This means that whatever artifacts you may find in the training set might not hold true when applied to another dataset (i.e. the test set)."

It looks like the way to handle that side effect is the: "Regularization".

I have been using R, and one of the most popular packages to deal with this is: "vtreat" which is used here:

https:// www.r-bloggers.com/vtreat-prepare-data/

For sure that package is awesome, but I think is going to take me a while to be familiar with.

My question is: Can the Rapidminer do "Target Encoding" as well?, doing at the same time: "Regularization"? Probably its the very intuitive UI helps.

Find more posts tagged with