🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Should you normalize dummy coded variables in clustering?

User: "Curious"
New Altair Community Member
Updated by Jocelyn
Can you keep them as dummies and only normalize numeric variables?

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "Telcontar120"
    New Altair Community Member
    Accepted Answer
    The distance calculations are going to be biased if your attributes are in dramatically different ranges.  So as @IngoRM says the best solution would be to normalize all attributes into the same range (i.e., just use range normalization on the interval 0-1 for your numerics).  If you don't have extreme outliers, that would be fine.

    However, I ordinarily wouldn't recommend normalizing dummy variables using the z-score method because the z-score method is not well suited to exclusively bi-modal distributions (which a dummy variable is by definition).
    If you have already used z-score normalization on your numerical attributes and you also have dummy variables then as long as you don't have any massive outliers you can also just normalize the z-scores again into the 0-1 range method and it should also be fine. 

    But even leaving the z-scores shouldn't be too bad (since they are typically in the range -3 to 3) and it is certainly better than no normalization of numericals at all.  You can actually test this yourself by doing different types of normalization and seeing the effect on the resulting clusters.  In my experience, there is not usually a major difference in these cases.

    If you do have significant outliers, you might consider reviewing them carefully before trying to do the clustering because they are going to be problematic no matter which approach to normalization you choose.