🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

K-means cluster with text data

User: "joen841030"
New Altair Community Member
Updated by Jocelyn
Hello experts! 

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997



Also, I wonder is it possible to use something like 
Silhouette  scores to define the ideal number of cluster? Thank you!!!
Sort by:
1 - 1 of 11
    User: "lionelderkrikor"
    New Altair Community Member
    Accepted Answer
    Hi @joen841030,

    No, the average within centroid_distance_cluster i is not limited between -1 and +1. 
    The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
    between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.

    Here a ressource about average within cluster distance : 

    https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html  

    Hope this helps,

    Regards,

    Lionel