discretize by variance?

Hi. I have a DB, each row represents a person. One of the columns is the income. I tried to apply a K-Means to group the data set. Originally, I normalized and applyied logs to the income column, but the either way, results are not logical, because it groups people very dissimilar in terms of income. Although income is not the only variable, it is an important one. Because income has a big coefficient of variation (1000%), I though I can construct bins with similar coefficient of variation, i.e., up to 30%. After discretizing, I should transform the bins to numerical values in order to be used by the k-means operator.

It can be done in rapid miner? Any ideas that can help me.

Find more posts tagged with

AI Studio

Accepted answers

omoratto

Brian, thank you so much for your feedback. I tried your suggested approach by normalization not by z, unfortunatelly it came up with two groups. What I decided was to apply an outlier detection model before clustering the results, in that way, I Split the dataset into two sections (outlier, non-outlier) and applied k-means to each section. It worked pretty well.

Thank you

All comments

Telcontar120

If you use a log-transform on income first then you should be able to Discretize by Binning (based on whatever log-range you want to allow in each bin). Otherwise, you can simply Discretize by User Specification, if you know exactly where you want the cutpoints to appear based on your particular income distribution.

omoratto

Thanks for your answer. The issue is that i want to discretize the income directly because with the log-transform k-means is grouping individuals with very large income (i.e 6 Millions) with low income individuals (i.e., 60 k). I want to bin income in a way that each bin has low coefficient of variance (VC), i.e. < 30%, but doining directly on RM.

Or there is another way to accomplish this?

Thanks.

MartinLiebig

Hi,

i am not sure how this should work with variance? I mean, the variance of higher values is natually bigger? Usually you take other measures into account. Did you have a look at Discretize by Entropy?

~Martin

omoratto

Thanks for your answer. Precisely, because variance growths with the data, I was thinking on using the coefficient of variance (VC = sigma/|mean|) because is represented as a % of the mean. I will try discretize by entropy as you sugested and I´ll be back to you. Thanks a lot.

omoratto

I tried discretize by entropy but it generated only two categories, low income and high income, so is not useful

Telcontar120

First, you mentioned that you didn't get good results when you tried to Normalize. What method of normalization did you select for income when you ran that operator? If you have outliers or a very skewed distribution, then the range or proportional sum methods can be easily distorted. You might want to rerun using interquartile range and then see whether that improves things. Z-transformation is also more robust against outliers but it might not fit your distribution that well.

Next, if you do a log (base 10) transformation of your incomes, you should absolutely be able to specify equal width bins that will accommodate your desire to have a maximum proportional income range within each bin. If we are talking about annualized numbers, typically income is going to range from the ten-thousands perhaps up the millions, which is actually only 4 orders of magnitude, which means your log values will mostly be between 4 and 7. If you selected bin size of 0.2 (on the log scale), this would ensure that within any given bin, the variance percentage (sigma/mean) was not more than approximately 30%. (Check it out on a spreadsheet, it's just math!).

And as I mentioned, if you want your bins not to be equal in width for whatever reason, you can still use the Discretize by User Specification to simply create whatever bins you think are most appropriate for the actual distribution that you have.

Finally, is there any reason why you believe that 30% is a critical number when it comes to income variance? It seems like that is a fairly arbitrary threshold that you have defined. Perhaps you should look at a more data-driven mechanism to try to determine how income should affect the final model?

omoratto