interpreting the sum of TF-IDF scores of words across documents
LindsayKelevra
New Altair Community Member
hi guys! after doing a clustering on a list of documents with the k-means, I would like to analyze the words in each cluster (in order to correlate them with other attributes). About this I added up the value of tf-idf for each words, but I think that this solution can be wrong. Could it be more correct to use term frequency? Thnaks in advice.
0
Answers
-
Hi,i am not sure what you exactly asking? Can you eloberate a bit?And: Maybe LDA is something for you. It usually performs better to detect groups on texts.Best,Martin0
-
hi! I clustered (k-means) on an attribute containing an article for each record. Having used tf-idf now i have a matrix of words and relative frequency. Now i'm trying to analyze, for each cluster, the words contained. Since I have many attributes is it possible to sum the tf-idf frequency for each words? Alternatively I thought to use the average, is it more correct?0
-
Hi @LindsayKelevra ,this is what i usually do to understand my clusters: https://towardsdatascience.com/understanding-clustering-cf0117148ef4#b7aethat should also work on tf-idf.~Martin
0 -
Fundamentally you probably don't want to add TF-IDF values as it is not designed to be additive in nature (e.g., it doesn't have consistent scaling because it is multiplied by the log of the inverse document frequency).
If you want to use word your vector values directly, you should use one of the metrics that is inherently additive such as term occurrences, which is just a raw count of terms, or term frequency, which is just the unadjusted percentage of total terms that a particular term covers.
But I also agree with Martin that this is not the most intuitive way of trying to understand your clusters. You can use some of the methods he describes, or you can also just look at the centroid values directly (one of the outputs of the cluster operators) and find the values that are most distinct from one cluster to another (the graph visualization is helpful for this).0