"Matching Text with ngramms"

Agathon · January 2011

Hello,

i am still kinda new to rapidminer. From what i saw sofar, this is clearly a powerful result of massiv brainpower!

I have 2 questions, and hope some of you can help me:
the situation:
I have a master list of product descriptions (big) and have to find similar entries in other lists (small-medium size). It is a 1:n matching task.
I am using basic operators to get rid of unwanted text (stemming, stop words, html,..) and can generate ngramms. I do this twice, once for the master list, and once for a specific description and combine the results. Followed by sorting the results.

first the practical question:
Each description has to be matched on all entries of the master list (there is potential for optimization, however the dataset is too small to do this in the first step). Is there an operator to avoid redundant ngramm generation? What would be the best way to match say 5 lists of descriptions on the master list without redundant task?

and second, the theoretical question:
Can you think of a setup where i consider all lists equal? Basically a cloud of descriptions, where i aggregate the most similar ones?

If you could spare some time to assist me, i would be very grateful.

Thank You

IngoRM · January 2011

Hi Agathon,

i am still kinda new to rapidminer. From what i saw sofar, this is clearly a powerful result of massiv brainpower!

Thanks. We always like to hear if somebody appreciates that ;D

Is there an operator to avoid redundant ngramm generation? What would be the best way to match say 5 lists of descriptions on the master list without redundant task?

Well, there is no non-redundant n-gram generation directly. Maybe you could remove them afterwards. However, I am not sure though if I have understood your task correctly, but maybe it would be possible to calculate not only the n-grams but also a vectorized representation by, for example, TFIDF. In that case the redundant terms would no longer occur but you could calculate a similarity instead which would also deliver "fuzzy" matches (which could be disregarded if only perfect matches are of interest. But as I said, I am not sure if I got you correctly 100%...

Can you think of a setup where i consider all lists equal? Basically a cloud of descriptions, where i aggregate the most similar ones?

Again not for 100% sure but this would pretty much the way I have suggested above, right?

Cheers,
Ingo

Agathon · January 2011

Hi Ingo
Thank you for your quick repley:)

Ingo Mierswa wrote:

Well, there is no non-redundant n-gram generation directly. Maybe you could remove them afterwards. However, I am not sure though if I have understood your task correctly, but maybe it would be possible to calculate not only the n-grams but also a vectorized representation by, for example, TFIDF. In that case the redundant terms would no longer occur but you could calculate a similarity instead which would also deliver "fuzzy" matches (which could be disregarded if only perfect matches are of interest. But as I said, I am not sure if I got you correctly 100%...

Let me add some more Details:
On one side you have a complete list of products you are interested in, the list includes detailed description and information about these products.
On the other side you have all kinds of lists (in my case 5) with incomplete information, missing IDs, cut off descriptions or slightly different wording.
The task is to match the 5 lists on the first complete list. Not all products from the 5 lists must have a match in the master list, and you can assume that max. one entry from the master list matches.
Example:
Master list:
Apple iPhone 4 32GB black
Match:
iphone 4 32GB b.
iphone 32GB black
apple iphone 32GB black
etc.

What i can do so far is:
Selecting one product from the 5 lists, generate ngramms of the selection and the master list then match it.
But how this is done for all entries...? Just looping looks kinda sequential...

IngoRM · January 2011

Hi again,

well, looping would indeed be an option. Transforming first all matching list data into one vectorized example set which is matched via similarity again the master list would be the other one. If you have detailed questions about how this can be achieved with RapidMiner, I would suggest to post in the board "Data Mining / ETL / BI Processes" the processes you already have together with some detailed questions. It's more likely that anybody there can help you with those details.

Cheers,
Ingo

"Matching Text with ngramms"

Answers

Categories