"Finding the most similar document(s) in a collection to a test document"

Question

I have built an operator chain to compare a test document to a collection of documents in order to find the most similar documents to the test document. My original apporach did a similarity test across all documents (the collection and the test document) and filtered out just the results for the test document. Needless to say this resulted in comparing all of the collection against itself and was thus inefficient. I have since then tried the approach recommended at the end of this thread: http://rapid-i.com/rapidforum/index.php/topic,680.msg2587.html#msg2587. Unfortunately I am afraid I have produced a fairly inefficient solution. Could you look at the chain below and give me some advise to improve it? A couple of considerations:* I do the text input processing against the collection and the test document so that I have a consistent vocabulary for the similarity processing. * I get a text file in the log but would prefer an excel or CSV output. Perhaps I can do this in the ProcessLog with some constants for quotation marks and commas. Here is my chain: I thank you very much in advance for your thoughts and recommendations. Charles

land · Answer

Hi Charles,
although you might nearly do everything with a good choice of operators, these solutions are mostly far away from efficiency. If you want an efficient solution, our InformationRetrieval plugin might be worth a try. Unfortunately it's not finished yet, but at least an operator for calculating the distances between each examples of a first exampleset to each of a second. It then returns the k nearest examples and its distances.

If you are interested we might think about a price reduced pre-version.
If you want to write the similarities into an excel file, just apply Similarity2ExampleSet and write the resulting example set into an excel file using the ExcelExampleSetWriter.

Greetings,
  Sebastian