What is a good threshold for CosineSimilarity Measure?
mrc
New Altair Community Member
Hi RM community,
I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?
Thanks so much, any insights will be greatly appreciated as I'm very new to this!
Marcia
I'm using the Cosine Similarity measure in the Cross Distance operator to determine the relevance of documents in a corpus of 5000 documents to a reference document. I'm getting results ranging from 0.8 to 1.6, without any significant breakpoint between relevant and not-so-relevant documents. How can I determine a threshold that is mathematically sound so that I know that documents below the threshold can be categorized as relevant and the ones above as not relevant? In short, how does one determine a threshold for cosine similarity measures with the cross distances operator?
Thanks so much, any insights will be greatly appreciated as I'm very new to this!
Marcia
Tagged:
1
Answers
-
I do not have an answer yet but since posting this I’ve used the Normalize operator to normalize the results between 0 and 1. I am
now trying to decide what threshold makes sense - leaning towards 0.25 or 0.5. I’d like to justify my threshold choice with a mathematically sound answer but so far I have not come across one. Any insights to help?Thanks much!1 -
Hi @mrc, thanks for sharing your findings.
When I use the "cross distance" operator with cosine similarity on text/document, I usually have cosine similarities range from 0 to 1.
Just remember to use the "compute similarities" for cosine measurement.
If I calculate the distance instead of similarities, the result will be possibly out of [0,1] range. The higher similarity, the lower distances.
When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities. The distribution may vary in the histogram chart for another use case. Always double check the histogram before you pick the threshold.
Cheers,
YY1