I have two excel i want to find if row number 1 from one excel is present in my second exel or not.
pallav
New Altair Community Member
I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy matching in python .
Its is text "outlet name can be name of any outlet , address can be any address city state and zip . how we can see if that row is present in other excel or not . if not completely then by what percent it is matching
Its is text "outlet name can be name of any outlet , address can be any address city state and zip . how we can see if that row is present in other excel or not . if not completely then by what percent it is matching
1
Best Answer
-
Hi @mschmitz,
the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1.
In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.
Regards,
Balázs3
Answers
-
hi @pallav if you want a clear "is the row in the other sheet or not", I would simply do an inner join on the two tables on all the attributes and see if there are any examples coming out of the join. If you want a more fuzzy matching solution, I would concatenate all the text attributes into one attribute (using a space delimiter probably) and train a model to predict match/no match after text mining word vectorization.2
-
Hi,the other way would be to append the 2 data sets and use Remove Duplicates afterwards. Rows which are in both data sets are returned at the dup port.Best,Martin3
-
Hi @pallav,
How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.
operator doc
https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
previous discussions/knowledge base
https://community.rapidminer.com/discussion/53950/cross-distance-how-is-it-calculated
https://community.rapidminer.com/discussion/53715/two-documents-similarity-using-cross-distance
HTH!
YY1 -
@yyhuang @mschmitz @sgenzer - The problem is the text might not match 100% in that case if i will do cross join remove duplicate and other technique it may not match .. In that case we have to do somethiing like how we do in python . Either tokenizing the word and later taking tfid vector of each word find the cosine similarity with other excel after doing same preprocessing , This we can do in python but how can we do same in rapidminer .1
-
Hi @pallav,
i think we got three options here:
1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
2. Cross-Join the data, use Generate Levenshtein Distance
3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?
Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?
Best,
Martin2 -
Hi @mschmitz,
the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1.
In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.
Regards,
Balázs3