K-means cluster with text data

joen841030
joen841030 New Altair Community Member
edited November 5 in Community Q&A
Hello experts! 

I'd like to do k-means cluster with text data. My data is saved in one excel file. It has only one column with one word in each cell. Not sure whether I am doing it correctly (picture attached) because the output is like below, with cluster 3 having 4889 items??

Cluster 0: 20 items
Cluster 1: 18 items
Cluster 2: 20 items
Cluster 3: 4889 items
Cluster 4: 20 items
Cluster 5: 10 items
Cluster 6: 10 items
Cluster 7: 10 items
Total number of items: 4997



Also, I wonder is it possible to use something like 
Silhouette  scores to define the ideal number of cluster? Thank you!!!

Best Answer

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @joen841030,

    No, the average within centroid_distance_cluster i is not limited between -1 and +1. 
    The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
    between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.

    Here a ressource about average within cluster distance : 

    https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html  

    Hope this helps,

    Regards,

    Lionel

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @joen841030,

    You can find here a method to find the optimal number of cluster(s) k, based on the calculation of the Average within Centroid Distance according k (the number of clusters) : 

    https://community.rapidminer.com/discussion/comment/61654#Comment_61654

    Hope this helps,

    Regards,

    Lionel
  • joen841030
    joen841030 New Altair Community Member
    Hi @lionelderkrikor
    thanks for the reply! Hmm... but now that I got the results like below. It doesn't appear correct to me though...

    PerformanceVector:
    Avg. within centroid distance: -385.889
    Avg. within centroid distance_cluster_0: -393.196
    Avg. within centroid distance_cluster_1: -351.386
    Avg. within centroid distance_cluster_2: -410.075
    Avg. within centroid distance_cluster_3: -384.852
    Avg. within centroid distance_cluster_4: -403.787
    Avg. within centroid distance_cluster_5: -371.171
    Avg. within centroid distance_cluster_6: -366.001
    Avg. within centroid distance_cluster_7: -402.358
    Davies Bouldin: -0.500

    And now I included "nominal to numerical"...if I am actually doing it correctly? I was just following different online tutorials and trying to figure out how to do it...

    Thanksss so much in advance!




  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @joen841030,

    Why did you think that theses results are incorrect ?

    Regards,

    Lionel
  • joen841030
    joen841030 New Altair Community Member
    Hi @lionelderkrikor,
    Hmm because I presume the value should be something between -1 to +1? Sorry that I don't understnad those figures... It would be nice if you can kindly explain it. Thanks!
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Answer ✓
    Hi @joen841030,

    No, the average within centroid_distance_cluster i is not limited between -1 and +1. 
    The average within centroid_distance_cluster i is a measure of distance, for example of the Euclidean Distance for numeric attributes,
    between the points of the cluster i and the centroid of the cluster i. So this value quantify how "compact"/"dense" a cluster is. The value of this metric can be between 0 and +infinity but in the case of RapidMiner between -Infinity and 0 because the metric is multiplied by minus one because RapidMiner try to maximize this metric.

    Here a ressource about average within cluster distance : 

    https://rapidminernotes.blogspot.com/2011/04/how-average-within-cluster-distance-is.html  

    Hope this helps,

    Regards,

    Lionel