"[SOLVED] sampling a number of examples from different groups"
jan87
New Altair Community Member
Dear community,
is it possible to make a sample of let's say 50 examples from every from different groups, that are created through different attributes?
For example i have the attribute a with values 1, 2 and 3 and attribute b with values 1, 2 and 3. The groups that are built through the different combinations have a different amount of data. How can i get a sample with the same amount of examples from every group.
I already tried to use the multiply operator and then different filter operator, but i have so many groups, that this would take days to build...
Thanks for your help
is it possible to make a sample of let's say 50 examples from every from different groups, that are created through different attributes?
For example i have the attribute a with values 1, 2 and 3 and attribute b with values 1, 2 and 3. The groups that are built through the different combinations have a different amount of data. How can i get a sample with the same amount of examples from every group.
I already tried to use the multiply operator and then different filter operator, but i have so many groups, that this would take days to build...
Thanks for your help
0
Answers
-
Hi,
your base idea is good and you can follow it: filter the example set by groups with the help of Filter Examples, apply the sampling, and then append the data from all groups.
A chain of Loop Values operators will prevent you from creating the filter for each group manually. This process is still not trivial, but once setup, you can even add new groups to your data without the need to update the process.
Best, Marius0 -
Hi,
would you perhaps give me a small example how i can use this loop value operator for this problem as i do not understand how to use it...
thanks0 -
Here you go! Please note the use of the iteration macros in the Filter Examples operators.
The aggregation operator in the end is only to prove that you have 3 examples of each combination of att1 and att2.
You will get problems if a group contains less than (in this case) 3 examples. You could use the Branch operator to check that you have at least group_size examples and only apply the sampling in that case.
Down there you'll find the code.
All the best,
Marius<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
<process expanded="true" height="352" width="718">
<operator activated="true" class="generate_nominal_data" compatibility="5.3.000" expanded="true" height="60" name="Generate Nominal Data" width="90" x="112" y="30">
<parameter key="number_examples" value="1000"/>
</operator>
<operator activated="true" class="loop_values" compatibility="5.3.000" expanded="true" height="76" name="Loop Values" width="90" x="313" y="30">
<parameter key="attribute" value="att1"/>
<parameter key="iteration_macro" value="v1"/>
<process expanded="true" height="370" width="736">
<operator activated="true" class="loop_values" compatibility="5.3.000" expanded="true" height="76" name="Loop Values (2)" width="90" x="246" y="30">
<parameter key="attribute" value="att2"/>
<parameter key="iteration_macro" value="v2"/>
<process expanded="true" height="370" width="736">
<operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="att1=%{v1}"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples (2)" width="90" x="313" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="att2=%{v2}"/>
</operator>
<operator activated="true" class="sample" compatibility="5.3.000" expanded="true" height="76" name="Sample" width="90" x="447" y="30">
<parameter key="sample_size" value="3"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
<connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_port="example set" to_op="Loop Values (2)" to_port="example set"/>
<connect from_op="Loop Values (2)" from_port="out 1" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="5.3.000" expanded="true" height="76" name="Append" width="90" x="447" y="30"/>
<operator activated="true" class="aggregate" compatibility="5.3.000" expanded="true" height="76" name="Aggregate" width="90" x="581" y="30">
<list key="aggregation_attributes">
<parameter key="label" value="count"/>
</list>
<parameter key="group_by_attributes" value="|att1|att2"/>
</operator>
<connect from_op="Generate Nominal Data" from_port="output" to_op="Loop Values" to_port="example set"/>
<connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi Marius,
thank you very much for your very helpful example!
It's great you can solve this problem with RM, for which even SPSS seems not to have a solution...
0