I am clustering text from a discussion forum using k-means. I have followed the sample process called "09_KMeansWithPlot" (thanks Ingo!) to determine the optimum number of clausters via the following measures: (W) Avg Within Cluster Distance and (DB) Davies-Bouldin Index.
My understanding is that the DB index "is a function of the ratio of the sum of within-cluster (i.e. intra-cluster) scatter to between cluster (i.e. intercluster) scatter. A good value for the number of clusters is associated to lower values of this index."
That being said I am having trouble interpreting my results...
- Why are some of my DB values negative infinity?
- Some of my DB graphs have a gentle negative slopes - How do I know where the optimum number of clusters is because it appears there is no "elbow" in the trend line?
- Why do some of the charts only plot a certain number of clusters? For example the x-axis shows, 2,12,22,etc. instead of all the clusters, 1, 2, 3,...22 etc.?
- Are there any rules of thumb I should keep in mind when using the DB index against text data?
Thanks Rapidminers!