I have two excel i want to find if row number 1 from one excel is present in my second exel or not.

I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy matching in python .

Its is text "outlet name can be name of any outlet , address can be any address city state and zip . how we can see if that row is present in other excel or not . if not completely then by what percent it is matching

Image: https://us.v-cdn.net/6030995/uploads/editor/vo/032b5j1qfdc0.jpg

Image: https://us.v-cdn.net/6030995/uploads/editor/09/bafsvg0zn1n5.jpg

Find more posts tagged with

AI Studio

Excel

Accepted answers

BalazsBaranyRM

Hi @mschmitz,

the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1.

In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.

Regards,
Balázs

All comments

sgenzer

hi @pallav if you want a clear "is the row in the other sheet or not", I would simply do an inner join on the two tables on all the attributes and see if there are any examples coming out of the join. If you want a more fuzzy matching solution, I would concatenate all the text attributes into one attribute (using a space delimiter probably) and train a model to predict match/no match after text mining word vectorization.

MartinLiebig

Hi,

the other way would be to append the 2 data sets and use Remove Duplicates afterwards. Rows which are in both data sets are returned at the dup port.

Best,

Martin

YYH

Hi @pallav,

How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.

operator doc
https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
previous discussions/knowledge base
https://community.rapidminer.com/discussion/53950/cross-distance-how-is-it-calculated
https://community.rapidminer.com/discussion/53715/two-documents-similarity-using-cross-distance

HTH!
YY

pallav

@yyhuang @mschmitz @sgenzer - The problem is the text might not match 100% in that case if i will do cross join remove duplicate and other technique it may not match .. In that case we have to do somethiing like how we do in python . Either tokenizing the word and later taking tfid vector of each word find the cosine similarity with other excel after doing same preprocessing , This we can do in python but how can we do same in rapidminer .

MartinLiebig

Hi @pallav,

i think we got three options here:
1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
2. Cross-Join the data, use Generate Levenshtein Distance
3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?

Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?

Best,
Martin

pallav

@mschmitz -I am good with European timing. We can arrange call now.

BalazsBaranyRM