"[SOLVED] k-Nearest Neighbor for a clustered search"

tiramisusann
tiramisusann New Altair Community Member
edited November 5 in Community Q&A
Hi,

I started using Rapid Miner a few weeks ago in order to complete my master thesis. Now I have a big problem, and I hope, anyone could help me.

First of all, the goal:
I am trying to solve the problem of retrieving documents that are similar to an input document. The goal is to return one document from a collection that closely matches the input document.

My Solution: I am creating word vectors of both the documents of the collection and of the input document. Afterwards I am clustering the collection documents via k-means in order to receive clusters and their centroids. To find documents which match the input document I want to compare the centroid vectors with the input document vector. Farther I just want to take those documents into account, which are included in the cluster with the most similar centroid vector. Then I want to determine the most similar document from that small selection.

My Problem: Via k-Nearest Neighbor-Algorithm I try to compare the input vector with the centroid vectors of the collection-clustering. But I don't know how to implement that properly in RapidMiner.
- how could I only use the centroid vectors as input in kNN?
- is there any possibility to receive the most similar cluster as output?
- is there any possibility to receive the most similar document as output?

A picture from the current process:
http://s7.directupload.net/file/d/3404/cuvjjqez_png.htm

Hope so much, anyone could help.
All the best!

image

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Hey,

    this is a quite complex problem. To solve it, you need several steps:

    1. k-Means delivers a cluster model. Apply it on your input document to find out into which cluster it belongs.
    2. Join the input document with the complete data set via the cluster attribute (the input document must be connected to the left port). That will keep only those documents from the complete collection that are in the same cluster.
    3. Use Cross-Distances to calculate the distances of the input document to the result of (2). Use only_top_k to retrieve the id of the closest document. The input document must be connected to the req port.

    Good luck for your thesis!
    ~Marius
  • tiramisusann
    tiramisusann New Altair Community Member
    Marius, you are my hero.  ;). Thank you very very much! It worked, I am able to identify the most similar document of my collection now.

    The only thing that doesn't work yet is the computing of the similarity (or distance) by the cross-distance-operator. He's showing only '?' in the field 'distance'. I believe the problem is, that the vectors are not identical concering the numer of rows. But if I am processing the input document separated from the collection documents (what I need to do), then it is not possible to have the same amount of rows inside the vector.

    Am I right? Or is there a chance to get a numerical entry for the similarity finally?

  • MariusHelf
    MariusHelf New Altair Community Member
    The amount of roles does not matter, but you should have the same columns in both datasets.
    If you are using different document collections with different Process Documents operators it is crucial to:
    1. perform the same steps in both operators (same inner process)
    2. connect the wor output of the first operator to the wor input of the second one.

    Best regards,
    Marius