🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"sampling using external dataset weights"

User: "cesko80"
New Altair Community Member
Updated by Jocelyn
I have 2 datasets having the same attributes. I need to sample dataset n.2 rebuilding the same value distribution of dataset 1 on a given attribute.

Example dataset1:
Id;attr1;attr2
1;a;555
2;a;550
3;b;400

Example dataset2:
Id;attr1;attr2
1;a;555
2;a;550
3;a;551
4;a;590
5;b;420

Example of sampled dataset2 based on attr1 in dataset1::
Id;attr1;attr2
2;a;550
4;a;590
5;b;421

Find more posts tagged with

Sort by:
1 - 3 of 31
    User: "Andrew2"
    New Altair Community Member
    Hello

    I'm not quite sure what is needed. Is it as follows.

    In the first example set count the number of times a appears and count the number of times b appears. Express this as a fraction of the whole so it would be 0.67 a and 0.33 b. Randomly select from the second example set so that a and b appear with the same fraction as in the first. Is that correct?

    Andrew
    User: "cesko80"
    New Altair Community Member
    OP
    Yes. It's correct. By the way: the process would be even more complete if we could generate new attributes for data set 2 in the case data set 2 had less "a" than data set 1...maybe using avarages as missing replacement.thanks
    User: "Andrew2"
    New Altair Community Member
    Hello

    It's fiddly but it will be possible. I have something simple worked out but if I give you some pointers you may find that you will get there first since I don't have the spare time at the moment.

    Use Aggregate to count the numbers of examples of type a and type b.
    Use Extract Macros and Generate Macros to determine the ratio of a to b.
    In the set to be sampled, assume one type is the most frequent. Filter and allow this type through completely. The number of the other value is determined by the number of the first value adjusted by the ratio from earlier. Use the Sample operator to select this number of the other value. Use Append to join example sets together.

    This is not a generic method and will struggle if some of the assumptions are wrong.

    Andrew