Balancing Data based on class
b00122599
New Altair Community Member
Hey folks,
Get a bit lost here playing with Sampling Operators but not getting anywhere. I have a record set of 150k entries with three classes two of the classes are very small less than 10k each. I would like to output a result where I have an equal amount of all three classes so if I have 15k then I'll have 5k Class A,5k Class B and 5k Class C. I will lose a lot of the largest class but I want to compare all three classes in this way. Would anyone have any pointers? Thanks in advance.
Neil.
Get a bit lost here playing with Sampling Operators but not getting anywhere. I have a record set of 150k entries with three classes two of the classes are very small less than 10k each. I would like to output a result where I have an equal amount of all three classes so if I have 15k then I'll have 5k Class A,5k Class B and 5k Class C. I will lose a lot of the largest class but I want to compare all three classes in this way. Would anyone have any pointers? Thanks in advance.
Neil.
Tagged:
0
Best Answer
-
Hi Neil,You can use the operator Sample for this with the "balance data" option activated. If you do this, you can specify the desired number of classes for each of your classes. Below is a small example process demonstrating this.Hope this helps,
Ingo<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34"><br> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/><br> </operator><br> <operator activated="true" class="sample" compatibility="9.2.001" expanded="true" height="82" name="Sample" width="90" x="179" y="34"><br> <parameter key="sample" value="absolute"/><br> <parameter key="balance_data" value="true"/><br> <parameter key="sample_size" value="100"/><br> <parameter key="sample_ratio" value="0.1"/><br> <parameter key="sample_probability" value="0.1"/><br> <list key="sample_size_per_class"><br> <parameter key="Yes" value="200"/><br> <parameter key="No" value="200"/><br> </list><br> <list key="sample_ratio_per_class"/><br> <list key="sample_probability_per_class"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Sample" to_port="example set input"/><br> <connect from_op="Sample" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1
Answers
-
Hi Neil,You can use the operator Sample for this with the "balance data" option activated. If you do this, you can specify the desired number of classes for each of your classes. Below is a small example process demonstrating this.Hope this helps,
Ingo<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34"><br> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/><br> </operator><br> <operator activated="true" class="sample" compatibility="9.2.001" expanded="true" height="82" name="Sample" width="90" x="179" y="34"><br> <parameter key="sample" value="absolute"/><br> <parameter key="balance_data" value="true"/><br> <parameter key="sample_size" value="100"/><br> <parameter key="sample_ratio" value="0.1"/><br> <parameter key="sample_probability" value="0.1"/><br> <list key="sample_size_per_class"><br> <parameter key="Yes" value="200"/><br> <parameter key="No" value="200"/><br> </list><br> <list key="sample_ratio_per_class"/><br> <list key="sample_probability_per_class"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Sample" to_port="example set input"/><br> <connect from_op="Sample" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
1 -
Thanks very much that did the trick! Neil.0