"K-Means Clustering with Mixed Attributes"

New Altair Community Member
Hello Everyone,
I want to segment my customer base (13,000 customers) according to several attributes such as:
1. Total Deposits (numerical)
2. Total #Accounts (integer)
3. #Months Since Customer Acquisition (Integer)
4. Has the client subscribed for Online Banking or not? (Categorical)
I want to see what is common among my customers by splitting them into clusters.
I have mixed attributes in my data set (numerical and categorical).
The questions I have are:
1. What is the best distance measure in this case?
2. Do I need to transform any attribute?
3. Do I need to normalize any attribute?
4. What is the best way to set up the model?
Any help would be appreciated.
Thank You
I want to segment my customer base (13,000 customers) according to several attributes such as:
1. Total Deposits (numerical)
2. Total #Accounts (integer)
3. #Months Since Customer Acquisition (Integer)
4. Has the client subscribed for Online Banking or not? (Categorical)
I want to see what is common among my customers by splitting them into clusters.
I have mixed attributes in my data set (numerical and categorical).
The questions I have are:
1. What is the best distance measure in this case?
2. Do I need to transform any attribute?
3. Do I need to normalize any attribute?
4. What is the best way to set up the model?
Any help would be appreciated.
Thank You
There is no "best". You simply need to try it. I would recommend trying Euclidian, Manhattan and Cosine Similiarity
1. What is the best distance measure in this case?
Probably yes. I would use Nominal to Numerical to make the Categorial value to 0/1 variables. Otherwise you can only use MixedEuclidianDistance.
2. Do I need to transform any attribute?
Almost for sure. Otherwise you would introduce a (most likely not wanted) implicit weightening of your attributes
3. Do I need to normalize any attribute?
I'm not sure what you mean by this.
4. What is the best way to set up the model?
A general tip: You can analyse your clusters by taking it as label and user Feature Selection techniques (e.g. Weight by Gini Index or a Forward Selection). Then you get the most important attributes distingiushing the clusters.
If you use a one vs all strategy you can answer the question "What distinguishes cluster1 from the others?" which might be really helpful for interpreting results.
Second tip: If you use any other distance than euclidian distance, you should not use k-Means but k-Menoids. Otherwise the algorithm might not converge.
Martin0 -
Hi Martin,
Thank you for your reply.
As you might have guessed, i am new to RapidMiner.
It would be great if you could elaborate some more in terms of steps you would take first.
Of course, bringing in the data would be the first one.
But I am having problems in figuring out the sequence of other steps.
Q. Do you transform first and then normalize? or vice-versa?
Q. How do you transform and what to transform and into what? Same with normalizing.
Q. How do you determine how good your model is and if you've clustered your data well?
it would be great if you could let me know.
I appreciate your help.
Thank You0 -
Hi again,
all normalize models are just applyable on numerical data. so in general i would use transformation first.
I would use Nominal to Numerical using dummy coding. be sure to exclude the second newly created attribute using select attributes.
For normalizing i would use either Z-Transformation or range transformation in the Normalize operator.
The thing about "how good" is a model is a really tricky one. Thats maybe the biggest problem in unsupervised learning. You can either look if those clusters make sense (maybe using the label approach i mentioned earlier) or take a look on the performance which can be generated by the clustering performance operators. Those performances usually have the problem, that more k result in better values. As i said: This is really a tricky problem
Do you use RM 6.3? Then wisdom of the crowds might really help you in choosing the paramaters.
Martin1 -
Hi Martin,
What would be the approach if there are more categorical variables, especially of nominal type?
In khannadh's list of attributes, if there are more nominals like - 5. Location of the customer 6. Preferred marketing material.
What is the recommended approach for clustering in RM?
I use Gower distance for mixed datatypes in R.