Building Similarity Matrix

Question

Hi all,

My problem is as following: Given two groups of documents, I want to compute Cosine similarity and output a similarity matrix with all the possible comparisons. The matrix should contain the names of documents (and not terms).

For the moment, the operator pipeline is:

Process documents from Files -> Data to Similarity

My questions:

1) Is it OK to use Process documents from Files operator and in text directories create 2 entries with different documents to compare (so I will have 2 class names,i.e. 2 folders with different documents to compare)

2) What is the operator that allows to visualize a document similarity matrix?

Any advice is very much appreciated!

Andrew2 · Answer

Hello tutur

The Process Documents operators are basically the same.

You could use the Join operator to match document ids in the similarity matrix with the original. You would need some renaming and set role steps.

The Map operator might be easier.

regards

Andrew

tutur · Answer

Hello Andrew,

Thank you for reply!

I asked the 1st questions because I saw another way of document processing, the pipeline is like this:

Loop Files - (Read Document as nested process)  -> Process Documents

I was wondering if there is any principle difference with just Process Documents from Files operator.

The Data to Similarity Data operator is indeed what I need but I can't find info how to replace id's with document names ...could you please provide any suggestions?
For the moment, Block Plotter : x-Axis - First_ID, y-Axis - Second_Id.. 
May be it's possible to replace them in Data View table?

Thanks again,

Andrew2 · Answer

hello tutur

I don't quite understand what you mean for your first question - but the Process Documents from Files operator does work like that.

For the second question, use the Data to Similarity Data operator and plot the result using the Block Plotter. You will have to do some lookup and replacing work to get the document names instead of an id.

regards

Andrew