Altair RISE
A program to recognize and reward our most engaged community members
Nominate Yourself Now!
Home
Discussions
Community Q&A
interpreting the sum of TF-IDF scores of words across documents
LindsayKelevra
hi guys! after doing a clustering on a list of documents with the k-means, I would like to analyze the words in each cluster (in order to correlate them with other attributes). About this I added up the value of tf-idf for each words, but I think that this solution can be wrong. C
ould it be more correct to use term frequency? Thnaks in advice.
Find more posts tagged with
AI Studio
Clustering
k-Means Clustering
Text Mining + NLP
Term Frequency + TF-IDF
Accepted answers
All comments
MartinLiebig
Hi,
i am not sure what you exactly asking? Can you eloberate a bit?
And: Maybe LDA is something for you. It usually performs better to detect groups on texts.
Best,
Martin
LindsayKelevra
hi! I clustered (k-means) on an attribute containing an article for each record. Having used tf-idf now i have a matrix of words and relative frequency. Now i'm trying to analyze, for each cluster, the words contained. Since I have many attributes is it possible to sum the tf-idf frequency for each words? Alternatively I thought to use the average, is it more correct?
MartinLiebig
Hi
@LindsayKelevra
,
this is what i usually do to understand my clusters:
https://towardsdatascience.com/understanding-clustering-cf0117148ef4#b7ae
that should also work on tf-idf.
~Martin
Telcontar120
Fundamentally you probably don't want to add TF-IDF values as it is not designed to be additive in nature (e.g., it doesn't have consistent scaling because it is multiplied by the log of the inverse document frequency).
If you want to use word your vector values directly, you should use one of the metrics that is inherently additive such as term occurrences, which is just a raw count of terms, or term frequency, which is just the unadjusted percentage of total terms that a particular term covers.
But I also agree with Martin that this is not the most intuitive way of trying to understand your clusters. You can use some of the methods he describes, or you can also just look at the centroid values directly (one of the outputs of the cluster operators) and find the values that are most distinct from one cluster to another (the graph visualization is helpful for this).
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)
Groups