"K-Means Clustering with Mixed Attributes"

Question

Hello Everyone,
I want to segment my customer base (13,000 customers) according to several attributes such as:

1. Total Deposits (numerical)
2. Total #Accounts (integer)
3. #Months Since Customer Acquisition (Integer)
4. Has the client subscribed for Online Banking or not? (Categorical)

I want to see what is common among my customers by splitting them into clusters.

I have mixed attributes in my data set (numerical and categorical).

The questions I have are:
1. What is the best distance measure in this case?
2. Do I need to transform any attribute?
3. Do I need to normalize any attribute?
4. What is the best way to set up the model?

Any help would be appreciated.

Thank You

tuts · Answer

Hi Martin,

What would be the approach if there are more categorical variables, especially of nominal type?

In khannadh's list of attributes, if there are more nominals like - 5. Location of the customer 6. Preferred marketing material.

What is the recommended approach for clustering in RM?

I use Gower distance for mixed datatypes in R.

Thanks!

MartinLiebig · Answer

Hi again,

all normalize models are just applyable on numerical data. so in general i would use transformation first.

I would use Nominal to Numerical using dummy coding. be sure to exclude the second newly created attribute using select attributes.

For normalizing i would use either Z-Transformation or range transformation in the Normalize operator.

The thing about "how good" is a model is a really tricky one. Thats maybe the biggest problem in unsupervised learning. You can either look if those clusters make sense (maybe using the label approach i mentioned earlier) or take a look on the performance which can be generated by the clustering performance operators. Those performances usually have the problem, that more k result in better values. As i said: This is really a tricky problem

Do you use RM 6.3? Then wisdom of the crowds might really help you in choosing the paramaters.

Cheers,
Martin

khannadh · Answer

Hi Martin,

Thank you for your reply.

As you might have guessed, i am new to RapidMiner. 
It would be great if you could elaborate some more in terms of steps you would take first.

Of course, bringing in the data would be the first one. 
But I am having problems in figuring out the sequence of other steps.

Q. Do you transform first and then normalize? or vice-versa? 
Q. How do you transform and what to transform and into what? Same with normalizing. 
Q. How do you determine how good your model is and if you've clustered your data well?

it would be great if you could let me know.
I appreciate your help.

Thank You