"Matching Text with ngramms"
Hello,
i am still kinda new to rapidminer. From what i saw sofar, this is clearly a powerful result of massiv brainpower!
I have 2 questions, and hope some of you can help me:
the situation:
I have a master list of product descriptions (big) and have to find similar entries in other lists (small-medium size). It is a 1:n matching task.
I am using basic operators to get rid of unwanted text (stemming, stop words, html,..) and can generate ngramms. I do this twice, once for the master list, and once for a specific description and combine the results. Followed by sorting the results.
first the practical question:
Each description has to be matched on all entries of the master list (there is potential for optimization, however the dataset is too small to do this in the first step). Is there an operator to avoid redundant ngramm generation? What would be the best way to match say 5 lists of descriptions on the master list without redundant task?
and second, the theoretical question:
Can you think of a setup where i consider all lists equal? Basically a cloud of descriptions, where i aggregate the most similar ones?
If you could spare some time to assist me, i would be very grateful.
Thank You