Creating equally sized clusters that are representative for the population

Kristjan_Mar
Kristjan_Mar New Altair Community Member
edited November 5 in Community Q&A
Hi all,

I have a set of data (population) with individuals that have signed up to be a part of a group. When they signed up they gave some background information, leaving me with 5 variables that I am mostly focusing on. 

What I want to do is create 4 equally sized groups that are as representative for the whole population as possible. That is, I want to create 4 homogenous groups. 

Also, I have some other columns in the dataset that are important in handling/using the dataset. I would like this information to be included in each of the groups (subsamples) so that they still match the respondent that they should belong to. 

In short: How can I create four homogenous subsamples that are representative of the population, using only selected variables from the dataset?

Cheers, K

Best Answers

  • Marco_Barradas
    Marco_Barradas
    Altair Employee
    Answer ✓
    Hi @Kristjan_Mar it seems you need to create 4 stratified samples of your data.
    For that you need to use the Split Data operator with  sampling type stratified.

    Hope that helps you.
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I think I am confused about your wording of your intended outcome here---"as representative of the whole population as possible" and "homogeneous" are typically not synonymous.  If you want the groups to be as representative of the whole as possible, you basically want random subsets, which you can accomplish easily by Split Data and choosing sampling type of shuffled. You would only need to select the sampling type of stratify if you first choose a nominal attribute as your label to stratify on, and you want to make sure that each resulting partition contains the same proportions of these label classes.  I suggest you have a look at the tutorial and help explanation of the Split Data operator. (You can use Select Attributes prior to the split to only bring in the 5 attributes that you are interested in if you only want to look at those).

Answers

  • Marco_Barradas
    Marco_Barradas
    Altair Employee
    Answer ✓
    Hi @Kristjan_Mar it seems you need to create 4 stratified samples of your data.
    For that you need to use the Split Data operator with  sampling type stratified.

    Hope that helps you.
  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓
    I think I am confused about your wording of your intended outcome here---"as representative of the whole population as possible" and "homogeneous" are typically not synonymous.  If you want the groups to be as representative of the whole as possible, you basically want random subsets, which you can accomplish easily by Split Data and choosing sampling type of shuffled. You would only need to select the sampling type of stratify if you first choose a nominal attribute as your label to stratify on, and you want to make sure that each resulting partition contains the same proportions of these label classes.  I suggest you have a look at the tutorial and help explanation of the Split Data operator. (You can use Select Attributes prior to the split to only bring in the 5 attributes that you are interested in if you only want to look at those).

  • Kristjan_Mar
    Kristjan_Mar New Altair Community Member
    Thank you @MarcoBarradas and @Telcontar120!