Convert categorical variables into dummy variables

aisyahwahyuna
aisyahwahyuna New Altair Community Member
edited November 5 in Community Q&A
Hi, I want to perform a regression task to predict continuous response. I have 4 categorical variables, others are numerical. 

Categorical variables are:
age=(≤20, 21-35, 36-50, ≥51)
gender=(Female, Male)
income level=(1=insufficient, 2=sufficient)
BMI range=(1=<25, 2=>25)
*Income level & BMI are keyed in as numerical code in my dataset

Let's say I want to perform SVM, RF, Decision Tree, MLR, and KNN;

1. Should I convert all categorical variables into dummy variables? 
2. If using numerical coding is more suitable, should I change the data type to nominal (binominal/polynominal) or retain it as integer?

Answers

  • Hi @aisyahwahyuna, unfortunately this is a case of it depends on which model you're using! Some models are able to handle categorical variables either in the way they're formulated, or doing an internal conversion - e.g. Decision Tree and GLM respectively. Any operator which can't will usually show you an error which reads something like this:

    Where you do want to use a model that can't support categorical variables, I'd personally be very careful in using numerical coding and recommend dummy encoding as a preferred method - here the nominal to numerical operator should work well. It can be appropriate in some instances, especially when it's binominal, but I use it sparingly as it can cause biasing of the output of your model. Hope this helps!