Building Similarity Matrix
Hi all,
My problem is as following: Given two groups of documents, I want to compute Cosine similarity and output a similarity matrix with all the possible comparisons. The matrix should contain the names of documents (and not terms).
For the moment, the operator pipeline is:
Process documents from Files -> Data to Similarity
My questions:
1) Is it OK to use Process documents from Files operator and in text directories create 2 entries with different documents to compare (so I will have 2 class names,i.e. 2 folders with different documents to compare)
2) What is the operator that allows to visualize a document similarity matrix?
Any advice is very much appreciated!
My problem is as following: Given two groups of documents, I want to compute Cosine similarity and output a similarity matrix with all the possible comparisons. The matrix should contain the names of documents (and not terms).
For the moment, the operator pipeline is:
Process documents from Files -> Data to Similarity
My questions:
1) Is it OK to use Process documents from Files operator and in text directories create 2 entries with different documents to compare (so I will have 2 class names,i.e. 2 folders with different documents to compare)
2) What is the operator that allows to visualize a document similarity matrix?
Any advice is very much appreciated!
Find more posts tagged with
Sort by:
1 - 3 of
31
Hello Andrew,
Thank you for reply!
I asked the 1st questions because I saw another way of document processing, the pipeline is like this:
Loop Files - (Read Document as nested process) -> Process Documents
I was wondering if there is any principle difference with just Process Documents from Files operator.
The Data to Similarity Data operator is indeed what I need but I can't find info how to replace id's with document names ...could you please provide any suggestions?
For the moment, Block Plotter : x-Axis - First_ID, y-Axis - Second_Id..
May be it's possible to replace them in Data View table?
Thanks again,
Thank you for reply!
I asked the 1st questions because I saw another way of document processing, the pipeline is like this:
Loop Files - (Read Document as nested process) -> Process Documents
I was wondering if there is any principle difference with just Process Documents from Files operator.
The Data to Similarity Data operator is indeed what I need but I can't find info how to replace id's with document names ...could you please provide any suggestions?
For the moment, Block Plotter : x-Axis - First_ID, y-Axis - Second_Id..
May be it's possible to replace them in Data View table?
Thanks again,
I don't quite understand what you mean for your first question - but the Process Documents from Files operator does work like that.
For the second question, use the Data to Similarity Data operator and plot the result using the Block Plotter. You will have to do some lookup and replacing work to get the document names instead of an id.
regards
Andrew