"Compare two customer databases"

xiaobo_sxb
xiaobo_sxb New Altair Community Member
edited November 5 in Community Q&A
Hi

I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?

Best Regards
Steven

Answers

  • Hello Steven

    The Join operator lets you join tables together. You could also use a distance to similarity approach to see what records are close to one another.

    Regards,

    Andrew
  • xiaobo_sxb
    xiaobo_sxb New Altair Community Member
    Hi Andrew

    Thank you for your reply. I still have questions for your proposal.

    First, the join operator require the two dataset have the same ID (the key). For my case, I don't have the same ID.

    For the "data to similarity" operator, still not good enough. First, it will create a cross join across all the rows, in my case I have more than 10K rows for both of the tables, and I doubt the performance. Second, even I have the similarity score, I don't know the threshold for determining the possibility of two rows as the same customer. Is it possible to generate the possibility to say, how much percent of confidence we can say the two customers are actually the same one?

    Regards
    Steven
  • Hello Steven

    Well if there is no common ID then there is obviously no way to use Join.

    Actually a better operator would be Cross Distances which allows the selection of the top k nearest. The threshold completely depends on the data you have and I can't answer that.

    Performance may not be that bad; you have to try it.

    regards

    Andrew
  • xiaobo_sxb
    xiaobo_sxb New Altair Community Member
    Does anyone know how the cross distance operator work? Seems it works well for numbers but not for text. I use the operator for double bytes text and the result is totally incorrect. I created several records manually, some of them are quite close with only few words difference, some of them are quite different, but the distance it generated does not provide the result as it should be.

    Regards
    Steven
  • One way is to extract the words from each description to create a word vector where the value of each attribute is based on a measure like tf.idf. The text processing extension is your friend for this.

    Andrew
  • MariusHelf
    MariusHelf New Altair Community Member
    Also the Join operator is still an option, but it will only match customers that have an exact match in name, phone number etc.: in RapidMiner you can join by comparing the desired fields, even if there is no common id.
    An outer join is the join type you'll want to use in your case to spot customers that are part of only one of the tables.

    Best regards,
    Marius