Hi,
I started using Rapid Miner a few weeks ago in order to complete my master thesis. Now I have a big problem, and I hope, anyone could help me.
First of all, the goal:
I am trying to solve the problem of retrieving documents that are similar to an input document. The goal is to return one document from a collection that closely matches the input document.
My Solution: I am creating word vectors of both the documents of the collection and of the input document. Afterwards I am clustering the collection documents via k-means in order to receive clusters and their centroids. To find documents which match the input document I want to compare the centroid vectors with the input document vector. Farther I just want to take those documents into account, which are included in the cluster with the most similar centroid vector. Then I want to determine the most similar document from that small selection.
My Problem: Via k-Nearest Neighbor-Algorithm I try to compare the input vector with the centroid vectors of the collection-clustering. But I don't know how to implement that properly in RapidMiner.
- how could I only use the centroid vectors as input in kNN?
- is there any possibility to receive the most similar cluster as output?
- is there any possibility to receive the most similar document as output?
A picture from the current process:
http://s7.directupload.net/file/d/3404/cuvjjqez_png.htmHope so much, anyone could help.
All the best!