Cluster backetball players based on their performance...
LilC
New Altair Community Member
I have a dataset with many players and their performance for the season.
My goal is to cluster them into 3 or more groups based on their performance, like high, average, low performance etc..
The attributes are like positions, ave points, steals, mistakes, blocks, running distance etc....
It probably will be some analysis to do with k-means I guess. But I don't think I will need all attributes to do the clustering. And the other task is to find out which few attributes can be used to split the players.
I am still very new to RapidMiner. And thanks for all the help from you guys.
If anyone can point me the direction to achieve it, that will be great. And I am open to any extensions.
Thanks.
My goal is to cluster them into 3 or more groups based on their performance, like high, average, low performance etc..
The attributes are like positions, ave points, steals, mistakes, blocks, running distance etc....
It probably will be some analysis to do with k-means I guess. But I don't think I will need all attributes to do the clustering. And the other task is to find out which few attributes can be used to split the players.
I am still very new to RapidMiner. And thanks for all the help from you guys.
If anyone can point me the direction to achieve it, that will be great. And I am open to any extensions.
Thanks.
Tagged:
0
Best Answers
-
If you were to use k-means then you'd need numerical attributes. Make sure that you select attributes that are independent of each other. While k-means is not a linear model you could use Correlation Matrix to establish independence of attributes - ignore the matrix but look at the weights - the higher the weight, the more (linearly) independent of other attributes (and vice versa). While there are may other way of weighing attributes, one great thing about doing it this way is that you do not need to define a label in this process (we are not predicting anything)
1 -
The best use of cluster performance measures is to use them in optimisation in search for the best cluster parameters, e.g. using grid optimisation. A single performance measure will not be useful. Davis-Bouldin is very tricky and I never had much success with it, for DB to work your data must be smooth and convex, smooth as continuous and convex in multidimensional space which is hard to imagine and hard to achieve on real data. If you use DB decide on the range of k that is acceptable for you and pick DB closest to zero in that range, while avoiding peaks and troughs near the minimum (so go for the flat areas around). I often use item Distribution Performance and select sumofsquares as measure (a sort of cluster error from its centre). Then plot SSE vs k and look for the elbow, I. e. the point where increasing the clustering complexity, as given by k, no longer gives any significant gain in performance.
5
Answers
-
If you were to use k-means then you'd need numerical attributes. Make sure that you select attributes that are independent of each other. While k-means is not a linear model you could use Correlation Matrix to establish independence of attributes - ignore the matrix but look at the weights - the higher the weight, the more (linearly) independent of other attributes (and vice versa). While there are may other way of weighing attributes, one great thing about doing it this way is that you do not need to define a label in this process (we are not predicting anything)
1 -
Thanks again for the explanation and all the help. One more thing, after I used k-means, I did saw some video shows Cluster Distance Performance can be used to evaluate the clustering. Is there an illustration for 'Avg. within centroid distance' or 'Davies Bouldin'? You know like the rule of thumb correlation coefficients.
Or is that the result needs to be below 1 to make the clustering a 'good' one?
0 -
The best use of cluster performance measures is to use them in optimisation in search for the best cluster parameters, e.g. using grid optimisation. A single performance measure will not be useful. Davis-Bouldin is very tricky and I never had much success with it, for DB to work your data must be smooth and convex, smooth as continuous and convex in multidimensional space which is hard to imagine and hard to achieve on real data. If you use DB decide on the range of k that is acceptable for you and pick DB closest to zero in that range, while avoiding peaks and troughs near the minimum (so go for the flat areas around). I often use item Distribution Performance and select sumofsquares as measure (a sort of cluster error from its centre). Then plot SSE vs k and look for the elbow, I. e. the point where increasing the clustering complexity, as given by k, no longer gives any significant gain in performance.
5