Problems with Auto Model Cluster Analysis

Terpdog · May 2020

"I am using Auto Model to do a k-means cluster analysis. Works fine for 2 clusters. With 3 or more clusters or or more cluster has an average distance of ? and a Davies-Bouldin index of infinity. This appeared before and I thought Version 9.6 had fixed it but apparently not. It also appears in the beta of 9.7. Is there a way around this? Thanks."

lionelderkrikor · May 2020

Hi @Terpdog,

Can you share your data in order we can reproduce and understand what's going on ?

Regards,

Lionel

Terpdog · May 2020

I am not sure what files are needed but I have attached the only rapidminer file I could find and also an Excel file of the data. I was using only the first four variables for the cluster analysis.

lionelderkrikor · May 2020

Hi @Terpdog,

Thank you for sharing your data.
I can reproduce what you observe :

Image: https://us.v-cdn.net/6030995/uploads/editor/p1/hjvgl6tpl47f.png

But there is something strange in Auto-Model itself because
if I'm using your data (only the first four variables) with a k-Means model (with k = 3, 4,etc) in a classic RapidMiner process,
the results are correct (ie I obtain finite values for DB index and average distances) :

Image: https://us.v-cdn.net/6030995/uploads/editor/fp/v6ws8rkhgqmj.png

Has someone an idea of what's going on in Auto-Model (clustering) ?

In attached file, the classic (working) process in RapidMiner.

Regards,

Lionel

Terpdog · May 2020

Thanks Lionel. I did not think to try the process route. There has to be a bug in the Auto-Model routine. Hopefully that can get fixed. There is still a question of why the distances are negative which does not make sense.

lionelderkrikor · May 2020

@Terpdog,

The "real" distances are, of course, positive.
It seems to me that RapidMiner multiply the distances by minus one (-1) in order to work with negative values because
RapidMiner's algorithms are searching to MAXIMIZE these values. (explanation to be confirmed by the RM staff, @sgenzer ?)

Regards,

Lionel

Terpdog · May 2020

That makes sense. I am continually frustrated at how hard it is to get routine statistics following an analysis in RapidMiner. I am trying to use this in my book which talks about measures of fit in techniques such as cluster analysis, discriminant analysis and logistic regression and I can't get RapidMiner to produce them or it is so difficult it would be of no use to students. I may have to drop the idea of using it. Too bad.

Problems with Auto Model Cluster Analysis

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories