I have two excel i want to find if row number 1 from one excel is present in my second exel or not.

pallav
pallav New Altair Community Member
edited November 2024 in Community Q&A
I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy  matching in python .

Its is text "outlet name can be name of any outlet , address can be any address city state and zip .  how we can see if that row is present in other excel or not . if not completely then by what percent it is matching





Tagged:

Best Answer

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi @mschmitz,

    the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1

    In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.

    Regards,
    Balázs

Answers

  • sgenzer
    sgenzer
    Altair Employee
    hi @pallav if you want a clear "is the row in the other sheet or not", I would simply do an inner join on the two tables on all the attributes and see if there are any examples coming out of the join. If you want a more fuzzy matching solution, I would concatenate all the text attributes into one attribute (using a space delimiter probably) and train a model to predict match/no match after text mining word vectorization.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    the other way would be to append the 2 data sets and use Remove Duplicates afterwards. Rows which are in both data sets are returned at the dup port.
    Best,
    Martin
  • YYH
    YYH
    Altair Employee
    edited February 2020
    Hi @pallav,

    How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.

    operator doc
    https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
    previous discussions/knowledge base 
    https://community.rapidminer.com/discussion/53950/cross-distance-how-is-it-calculated
    https://community.rapidminer.com/discussion/53715/two-documents-similarity-using-cross-distance

    HTH!
    YY
  • pallav
    pallav New Altair Community Member
    @yyhuang @mschmitz @sgenzer - The problem is the text might not match 100% in that case if i will do cross join remove  duplicate and other technique it may not match .. In that case we have to do somethiing like how we do in python . Either tokenizing  the word and later taking tfid vector of each word find the cosine similarity with other excel after doing same preprocessing , This we can do in python but how can we do same in rapidminer .
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi @pallav,

    i think we got three options here:
    1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
    2. Cross-Join the data, use Generate Levenshtein Distance
    3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?

    Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?

    Best,
    Martin
  • pallav
    pallav New Altair Community Member
    @mschmitz -I am good with European timing. We can arrange call now.
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi @mschmitz,

    the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1

    In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.

    Regards,
    Balázs