Clustering accuracy & how can i pick the proper number of clusters
marou_mal96
New Altair Community Member
Hello there! I have two questions about clustering. The firtsi is about the number of clusters, more specifically Ι have only numerical attributes and i don't know what's the best cluster for my k-means clustering. The other question is if there is any way to perform my accuracy except from the "Map clustering on labels".
Thanks in advance!
Thanks in advance!
Tagged:
0
Best Answers
-
What you need is to set up an experiment. Use Optimise Parameters (Grid) to vary the number of clusters in k-means and log the cluster performance measures. Inside you will need k-means and some cluster performance, typically Davis-Bouldin (closest to zero is best) which can be obtained from Cluster Distance Performance, or Sum of Squares from Cluster Distribution Performance. DB measure works well when your attributes are numerical and smooth (convex shape as well), when you collect a log of k vs DB performance plot it and find the DB closest to zero, ideally in a smooth stable segment of the plot, this will be around the optimum k. However, DB often fails that stability test, in which case the k vs Sum of Squares (average distance from cluster centres) plot is a nice informal method, called the elbow method, where you look for such k beyond which the gain in performance (highest SOS) is no longer significant as compared to the clustering complexity (k), it often looks like the tip of an elbow.-1
-
Mapping of clusters on labels I find unreliable, especially when your clustering is not very good. One similar method is to combine k-means with k-nn to determine the cluster system ability to "predict" the cluster based on the neighbour distances and measure the accuracy of this process. However, when you consider what is important in clustering, ie all similar data points should be close to each other (as well as their cluster centroid) and far away from dissimilar ones (and centroids of other clusters), the other performance measures are more appropriate. It is also a good idea to use PCA to map your data into 2D and then plot your data in colour of the cluster to determine if clusters are cohesive and we'll separated.-1
-
One more warning: when you plot cluster performance make sure that you do not have any random effects affecting this process, e. g. clustering algorithm is influenced by the initial position of cluster centroids. So set the random seed of any operator which has the random element. Otherwise you will not know if the clustering improvement is due to the optimum k or the random effect. The random effect will usually show in your plot as the up and down zigzag.-1
-
Place it after your clustering and apply it to clustered examples (it can be a separate process), then scatter plot PC1 vs PC2 and use cluster as colour. You can also extract coordinates of the centroids from your cluster model using Extract Cluster Prototypes and you can plot them in the same PCA coordinate system as the rest of the data points (so simply apply that PCA model to the centroids and plot them separately). In this way you'll see if the cluster centres are well separated.-1
-
The last advice: keep your k practical, so often rather than finding the global optimum for cluster number, you may prefer to find the best k within a range. For example if you are conducting the customer segmentation for a marketing campaign, you may not be able to afford more than 10 separate campaigns, so it is not useful if the best number of clusters is 76, however it is practical if the best cluster number of up to 10 is 5.
-1 -
What I'd do is to build the PCA using the clustered examples but then apply the resulting PCA model to the centroids extracted from the cluster model, this way the PCA is built on lots of data and be more reliable.-1
-
I am not sure how urgent is your project, I am planning to continue recording my YouTube videos (check ironfrown) and can record a mini series on cluster analysis in RapidMiner in January. In the meantime, I strongly suggest to get a book by Vijay Kotu and Bała Deshpande, Data Science: Concepts and Practice 2nd Edition, where chapter 7 describes cluster analysis in RapidMiner (yes the whole book uses RapidMiner to explain different examples).-1
Answers
-
What you need is to set up an experiment. Use Optimise Parameters (Grid) to vary the number of clusters in k-means and log the cluster performance measures. Inside you will need k-means and some cluster performance, typically Davis-Bouldin (closest to zero is best) which can be obtained from Cluster Distance Performance, or Sum of Squares from Cluster Distribution Performance. DB measure works well when your attributes are numerical and smooth (convex shape as well), when you collect a log of k vs DB performance plot it and find the DB closest to zero, ideally in a smooth stable segment of the plot, this will be around the optimum k. However, DB often fails that stability test, in which case the k vs Sum of Squares (average distance from cluster centres) plot is a nice informal method, called the elbow method, where you look for such k beyond which the gain in performance (highest SOS) is no longer significant as compared to the clustering complexity (k), it often looks like the tip of an elbow.-1
-
Mapping of clusters on labels I find unreliable, especially when your clustering is not very good. One similar method is to combine k-means with k-nn to determine the cluster system ability to "predict" the cluster based on the neighbour distances and measure the accuracy of this process. However, when you consider what is important in clustering, ie all similar data points should be close to each other (as well as their cluster centroid) and far away from dissimilar ones (and centroids of other clusters), the other performance measures are more appropriate. It is also a good idea to use PCA to map your data into 2D and then plot your data in colour of the cluster to determine if clusters are cohesive and we'll separated.-1
-
One more warning: when you plot cluster performance make sure that you do not have any random effects affecting this process, e. g. clustering algorithm is influenced by the initial position of cluster centroids. So set the random seed of any operator which has the random element. Otherwise you will not know if the clustering improvement is due to the optimum k or the random effect. The random effect will usually show in your plot as the up and down zigzag.-1
-
How can i use PCA ?
At the moment i have this proccess. Where i can put the PCA?
0 -
Place it after your clustering and apply it to clustered examples (it can be a separate process), then scatter plot PC1 vs PC2 and use cluster as colour. You can also extract coordinates of the centroids from your cluster model using Extract Cluster Prototypes and you can plot them in the same PCA coordinate system as the rest of the data points (so simply apply that PCA model to the centroids and plot them separately). In this way you'll see if the cluster centres are well separated.-1
-
The last advice: keep your k practical, so often rather than finding the global optimum for cluster number, you may prefer to find the best k within a range. For example if you are conducting the customer segmentation for a marketing campaign, you may not be able to afford more than 10 separate campaigns, so it is not useful if the best number of clusters is 76, however it is practical if the best cluster number of up to 10 is 5.
-1 -
How do you like it?
0 -
Cluster 3 gives me the best value DB0
-
What I'd do is to build the PCA using the clustered examples but then apply the resulting PCA model to the centroids extracted from the cluster model, this way the PCA is built on lots of data and be more reliable.-1
-
Is there any tutorial about this or something else to help me create this you told me? I am a starter in rapidminer and i cannot understand much of what you said. Thank you again for your time sir!0
-
I am not sure how urgent is your project, I am planning to continue recording my YouTube videos (check ironfrown) and can record a mini series on cluster analysis in RapidMiner in January. In the meantime, I strongly suggest to get a book by Vijay Kotu and Bała Deshpande, Data Science: Concepts and Practice 2nd Edition, where chapter 7 describes cluster analysis in RapidMiner (yes the whole book uses RapidMiner to explain different examples).-1
-
Thank you very much sir! I appreciate it. Merry Christmas 🌲0