Sample One row within a group

yrgowtham
yrgowtham New Altair Community Member
edited November 5 in Community Q&A

Hi Experts,
I have a table with PatientID, the day of their stay and max vital signs for the day.
I want to create a process that randomly samples one day for each patient.
Table Structure :
PatientID    Day Number    Max_Temp   Max_Resp  Max_SBP    Max_HR
ABC                 1                    98.7            32               90                 72
ABC                 2                    98.8            33               95                 75
ABC                 3                    95              35               90                 78
DEF                 1                    98.7            32               90                 72
DEF                  2                   95              35               90                 78
the output of my process should have one day for each patient picked randomly and should look like as below :

PatientID    Day Number    Max_Temp   Max_Resp  Max_SBP    Max_HR
ABC                 2                    98.8            33               95                 75
DEF                 1                    98.7            32               90                 72

 

Methods I have tried :

  1. I have tried to use sample operator and use balance data option but it requires me to mention each PatientID in
    the parameter list (sample size per class).This is impossible because there are more than 50000 patientID
  2. Using R-code(Execute R)  will solve this, but trying to find if there is a way in Rapidminer to solve it.

    I am looking for a more automated method to achieve it in Rapidminer 

    Please let me know if you need more info.
    Thanks in advance :)


Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    You can sort your datset by a random variable (which you can add if you need to using "Generate Attributes") and then simply use "Remove Deuplicates" to get rid of records based on the patient id.  This should give you one random day per patient in the resulting dataset.  

  • kypexin
    kypexin New Altair Community Member

    @Telcontar120 - pretty elegant solution! however, why would you want to sort dataset by a random variable beforehand?

  • Telcontar120
    Telcontar120 New Altair Community Member

    @kypexin Sorting by a random variable should help ensure it doesn't systematically keep the same day for each patient.(I'm not 100% sure what the internal logic is for removing duplicates but it might conceivably be related to the order in which they appear, so if your dataset is sorted by the patient/day, that could lead to non- random sampling results.)