I have two excel i want to find if row number 1 from one excel is present in my second exel or not.

pallav
pallav New Altair Community Member
edited November 2024 in Community Q&A
I have two excel i want to find if row number 1 from first excel is present in my second excel or not. If not completely then by what percent it is matching . some thing like fuzzy  matching in python .

Its is text "outlet name can be name of any outlet , address can be any address city state and zip .  how we can see if that row is present in other excel or not . if not completely then by what percent it is matching





Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Best Answer

  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi @mschmitz,

    the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1

    In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.

    Regards,
    Balázs

Answers

  • sgenzer
    sgenzer
    Altair Employee
    hi @pallav if you want a clear "is the row in the other sheet or not", I would simply do an inner join on the two tables on all the attributes and see if there are any examples coming out of the join. If you want a more fuzzy matching solution, I would concatenate all the text attributes into one attribute (using a space delimiter probably) and train a model to predict match/no match after text mining word vectorization.
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,
    the other way would be to append the 2 data sets and use Remove Duplicates afterwards. Rows which are in both data sets are returned at the dup port.
    Best,
    Martin
  • YYH
    YYH
    Altair Employee
    edited February 2020
    Hi @pallav,

    How about the similarity analytics? You can try the "cross distances" on separate data sets or "data to similarity" if you append two tables together. For strings like address/city/state/outlet name or zipcode, you may want to normalize all upper case letters to lower case and then apply "nominal to numerical" before calculating the cross-distance between the examples. Of course, we can do the nominal measurements. But you have more various formulas with numerical measurement to quantitatively define the difference.

    operator doc
    https://docs.rapidminer.com/latest/studio/operators/modeling/similarities/cross_distances.html
    previous discussions/knowledge base 
    https://community.rapidminer.com/discussion/53950/cross-distance-how-is-it-calculated
    https://community.rapidminer.com/discussion/53715/two-documents-similarity-using-cross-distance

    HTH!
    YY
  • pallav
    pallav New Altair Community Member
    @yyhuang @mschmitz @sgenzer - The problem is the text might not match 100% in that case if i will do cross join remove  duplicate and other technique it may not match .. In that case we have to do somethiing like how we do in python . Either tokenizing  the word and later taking tfid vector of each word find the cosine similarity with other excel after doing same preprocessing , This we can do in python but how can we do same in rapidminer .
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi @pallav,

    i think we got three options here:
    1. Use Process Documents to get TF/IDF, Cross Distances to get the cosine similarity, filter and join the original data.
    2. Cross-Join the data, use Generate Levenshtein Distance
    3. Use the DataBase Envy extension to do a non equal join. (I am not sure if this supports complex joins on fuzzy stuff). @BalazsBarany can you give some feedback on this?

    Since you are also a customer, i would propose we do a quick call to walk you through? What times work better, European or East Coast work hours?

    Best,
    Martin
  • pallav
    pallav New Altair Community Member
    @mschmitz -I am good with European timing. We can arrange call now.
  • BalazsBaranyRM
    BalazsBaranyRM New Altair Community Member
    Answer ✓
    Hi @mschmitz,

    the join operator in Database Envy supports inner joins on criteria that can be expressed between two values (column A from example set 1 and column B from example set 2) with an arbitrarily complex expression. If you can create one measure per example set, you can match them with a fuzzy expression, e. g. Math.abs(a - b) < 1

    In this case I would go with a similarity matrix, with character n-grams if there is a reason to assume that the words are not written in the same way.

    Regards,
    Balázs

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.