[SOLVED] X-Distances Operator shows not the most similar document

Question

Hey everybody, hope, you could help me a second time because after thinking I solved a tricky problem it reveals as not so easy. I want to compare an Input-Document to the Documents of a Collection to find the most similar document of the collection. --> Via k-means-clustering I divide the collection-documents into groups, and compare them with the input document to find the most similar cluster (centroid vector) --> via "join" I extract the documents from the most similar cluster --> via "Cross Distances" with "only top k" I may find the most similar document of the collection to the input document. (input=req, documents=ref) First, it all seemed to work, but then I found, that "Cross Distances" does not show the most similar documents, but the documents with the smallest id (which are the first entries in the cluster). I have no idea why this operator is not working properly. The columns of req and ref are the same (except of the regular ones, which are the vector entries ,and differ for that reason in amount and weight). For the processing of both document collection and input document I use the same inner process. Hope so much, anyone does have a good idea, because its very urgent to solve this problem for me. All the best!! Note: the "read" operator is the output from the "create document-process documents..."-part ... just for that you see, how I process the input document.

tiramisusann · Answer

YES! It works!  ;D
Thank you so much for your help!!

All the best

tiramisusann

Andrew2 · Answer

Hello tiramisusann, That is the problem - you need to make sure the example sets are precisely the same in terms of numbers, names and types of attribute. The way to do this is to use the word list feature of the process document operators. Here is an example I made - notice the word list being connected from the output of the first process document operator to the input of the second. regards Andrew

tiramisusann · Answer

Hey Andrew,

well, the input document does have the following attributes:
role: text ; name: text ; type: text
role: id ; name: id ; type: integer
and then the vector entries, which all have the role "regular" and the type "real".

the document collection does have the following attributes
role: text ; name: text ; type: text
role: id ; name: id ; type: integer
and then the vector entries, which all have the role "regular" and the type "real".

what indeed is different, is the amount of regular attributes. Because the input document and the document collection are not processed together, they have not the same amount of regular attributes (vector entries). Do you think, that this is the problem actually?

I added two pictures to show you the output ...