"K-Means Clustering with Mixed Attributes"

khannadh
khannadh New Altair Community Member
edited November 5 in Community Q&A
Hello Everyone,
I want to segment my customer base (13,000 customers) according to several attributes such as:

1. Total Deposits (numerical)
2. Total #Accounts (integer)
3. #Months Since Customer Acquisition (Integer)
4. Has the client subscribed for Online Banking or not? (Categorical)

I want to see what is common among my customers by splitting them into clusters.

I have mixed attributes in my data set (numerical and categorical).

The questions I have are:
1. What is the best distance measure in this case?
2. Do I need to transform any attribute?
3. Do I need to normalize any attribute?
4. What is the best way to set up the model?

Any help would be appreciated.

Thank You

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi!

    1. What is the best distance measure in this case?
    There is no "best". You simply need to try it. I would recommend trying Euclidian, Manhattan and Cosine Similiarity

    2. Do I need to transform any attribute?
    Probably yes. I would use Nominal to Numerical to make the Categorial value to 0/1 variables. Otherwise you can only use MixedEuclidianDistance.

    3. Do I need to normalize any attribute?
    Almost for sure. Otherwise you would introduce a (most likely not wanted) implicit weightening of your attributes

    4. What is the best way to set up the model?
    I'm not sure what you mean by this.

    A general tip: You can analyse your clusters by taking it as label and user Feature Selection techniques (e.g. Weight by Gini Index or a Forward Selection). Then you get the most important attributes distingiushing the clusters.
    If you use a one vs all strategy you can answer the question "What distinguishes cluster1 from the others?" which might be really helpful for interpreting results.

    Second tip: If you use any other distance than euclidian distance, you should not use k-Means but k-Menoids. Otherwise the algorithm might not converge.

    Cheers,

    Martin
  • khannadh
    khannadh New Altair Community Member
    Hi Martin,

    Thank you for your reply.

    As you might have guessed, i am new to RapidMiner.
    It would be great if you could elaborate some more in terms of steps you would take first.

    Of course, bringing in the data would be the first one.
    But I am having problems in figuring out the sequence of other steps.

    Q. Do you transform first and then normalize? or vice-versa?
    Q. How do you transform and what to transform and into what? Same with normalizing.
    Q. How do you determine how good your model is and if you've clustered your data well?

    it would be great if you could let me know.
    I appreciate your help.

    Thank You
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi again,

    all normalize models are just applyable on numerical data. so in general i would use transformation first.

    I would use Nominal to Numerical using dummy coding. be sure to exclude the second newly created attribute using select attributes.

    For normalizing i would use either Z-Transformation or range transformation in the Normalize operator.

    The thing about "how good" is a model is a really tricky one. Thats maybe the biggest problem in unsupervised learning. You can either look if those clusters make sense (maybe using the label approach i mentioned earlier) or take a look on the performance which can be generated by the clustering performance operators. Those performances usually have the problem, that more k result in better values. As i said: This is really a tricky problem

    Do you use RM 6.3? Then wisdom of the crowds might really help you in choosing the paramaters.

    Cheers,
    Martin
  • tuts
    tuts New Altair Community Member

    Hi Martin,

     

    What would be the approach if there are more categorical variables, especially of nominal type?

    In khannadh's list of attributes, if there are more nominals like - 5. Location of the customer 6. Preferred marketing material.

    What is the recommended approach for clustering in RM? 

    I use Gower distance for mixed datatypes in R.

     

    Thanks!