Should you normalize dummy coded variables in clustering?

Curious
New Altair Community Member
Best Answer
-
The distance calculations are going to be biased if your attributes are in dramatically different ranges. So as @IngoRM says the best solution would be to normalize all attributes into the same range (i.e., just use range normalization on the interval 0-1 for your numerics). If you don't have extreme outliers, that would be fine.
However, I ordinarily wouldn't recommend normalizing dummy variables using the z-score method because the z-score method is not well suited to exclusively bi-modal distributions (which a dummy variable is by definition).
If you have already used z-score normalization on your numerical attributes and you also have dummy variables then as long as you don't have any massive outliers you can also just normalize the z-scores again into the 0-1 range method and it should also be fine.
But even leaving the z-scores shouldn't be too bad (since they are typically in the range -3 to 3) and it is certainly better than no normalization of numericals at all. You can actually test this yourself by doing different types of normalization and seeing the effect on the resulting clusters. In my experience, there is not usually a major difference in these cases.
If you do have significant outliers, you might consider reviewing them carefully before trying to do the clustering because they are going to be problematic no matter which approach to normalization you choose.
2
Answers
-
Hi,I would say this depends on the normalization. If you normalize the rest to the range between 0 and 1, you can keep them as is. Otherwise I would personally normalize all columns the same way (e.g. z-transformation).Hope this helps,Ingo3
-
The distance calculations are going to be biased if your attributes are in dramatically different ranges. So as @IngoRM says the best solution would be to normalize all attributes into the same range (i.e., just use range normalization on the interval 0-1 for your numerics). If you don't have extreme outliers, that would be fine.
However, I ordinarily wouldn't recommend normalizing dummy variables using the z-score method because the z-score method is not well suited to exclusively bi-modal distributions (which a dummy variable is by definition).
If you have already used z-score normalization on your numerical attributes and you also have dummy variables then as long as you don't have any massive outliers you can also just normalize the z-scores again into the 0-1 range method and it should also be fine.
But even leaving the z-scores shouldn't be too bad (since they are typically in the range -3 to 3) and it is certainly better than no normalization of numericals at all. You can actually test this yourself by doing different types of normalization and seeing the effect on the resulting clusters. In my experience, there is not usually a major difference in these cases.
If you do have significant outliers, you might consider reviewing them carefully before trying to do the clustering because they are going to be problematic no matter which approach to normalization you choose.
2 -
Hi,
i usually use PCA after dummy coding to get rid of the problem.
Best,
Martin1 -
@mschmitz but doesn't that get rid of your underlying attributes as well and replace them with synthetic PCs? That's probably not a helpful feature for clustering, or at least it wouldn't be for most of the clustering projects I have worked on.0
-
@Telcontar120,
i later on join the original data back to the clustering results and start to interpret from there.
BR,
Martin1