Filtering collection with criteria
pblack476
New Altair Community Member
I am on a quest to retrieve useful data from PDF.
I already conquered the first battle with the Table Extraction extension. I am now faced with another challenge:
How do I filter out a collection? Let's say I want to ignore examplesets with less than 10 examples in a collection and output a collection of al the other examplesets. How can I go about going that?
1
Best Answer
-
Hi @pblack476 ,you can use the Loop Collection operator to evaluate each example set individually.
Inside the loop, you can use a branch operator and discard those example sets, that don't fit requirement.
See the example process below for an example.
Best,
David<process version="9.6.000-BETA"><br><div> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process" origin="GENERATED_TUTORIAL"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:loop" compatibility="9.6.000-BETA" expanded="true" height="82" name="Loop" width="90" x="179" y="34"><br> <parameter key="number_of_iterations" value="20"/><br> <parameter key="iteration_macro" value="iteration"/><br> <parameter key="reuse_results" value="false"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="generate_macro" compatibility="9.6.000-BETA" expanded="true" height="68" name="Generate Macro" width="90" x="179" y="34"><br> <list key="function_descriptions"><br> <parameter key="random" value="round(rand()*100)"/><br> </list><br> <description align="center" color="transparent" colored="false" width="126">Generate a randon number between 1 and 100</description><br> </operator><br> <operator activated="true" class="generate_data" compatibility="9.6.000-BETA" expanded="true" height="68" name="Generate Data" width="90" x="380" y="34"><br> <parameter key="target_function" value="random"/><br> <parameter key="number_examples" value="%{random}"/><br> <parameter key="number_of_attributes" value="5"/><br> <parameter key="attributes_lower_bound" value="-10.0"/><br> <parameter key="attributes_upper_bound" value="10.0"/><br> <parameter key="gaussian_standard_deviation" value="10.0"/><br> <parameter key="largest_radius" value="10.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="datamanagement" value="double_array"/><br> <parameter key="data_management" value="auto"/><br> </operator><br> <connect from_op="Generate Data" from_port="output" to_port="output 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="loop_collection" compatibility="9.6.000-BETA" expanded="true" height="82" name="Loop Collection" width="90" x="715" y="34"><br> <parameter key="set_iteration_macro" value="false"/><br> <parameter key="macro_name" value="iteration"/><br> <parameter key="macro_start_value" value="1"/><br> <parameter key="unfold" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="branch" compatibility="9.6.000-BETA" expanded="true" height="82" name="Branch" width="90" x="447" y="34"><br> <parameter key="condition_type" value="min_examples"/><br> <parameter key="condition_value" value="50"/><br> <parameter key="expression" value=""/><br> <parameter key="io_object" value="ANOVAMatrix"/><br> <parameter key="return_inner_output" value="true"/><br> <process expanded="true"><br> <connect from_port="condition" to_port="input 1"/><br> <portSpacing port="source_condition" spacing="0"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_input 1" spacing="0"/><br> <portSpacing port="sink_input 2" spacing="0"/><br> </process><br> <process expanded="true"><br> <portSpacing port="source_condition" spacing="0"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_input 1" spacing="0"/><br> <portSpacing port="sink_input 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="single" to_op="Branch" to_port="condition"/><br> <connect from_op="Branch" from_port="input 1" to_port="output 1"/><br> <portSpacing port="source_single" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> <description align="center" color="yellow" colored="false" height="245" resized="false" width="180" x="408" y="130">Here the branch condition is minimum number of examples.<br/><br/>If it's over 50, the example set is passed through, if not it's discarded.<br/><br/>The same logic could be applied on number of attributes, or number of missings.</description><br> </process><br> </operator><br> <connect from_op="Loop" from_port="output 1" to_op="Loop Collection" to_port="collection"/><br> <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="142" y="142">Generate a collection with 20 example sets,<br/>with a random number of example (between 1 and 100)</description><br> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="664" y="159">Loop through all example sets in the collection and evaluate the number of examples</description><br> </process><br> </operator><br></process></div>
3
Answers
-
Hi @pblack476,
try the Loop Collection operator.
In each loop execution you'll get one example set. You can then use for example Extract Macro to determine the number of examples, and conditionally return the example set or not.
Regards,
Balázs2 -
Hi @pblack476 ,you can use the Loop Collection operator to evaluate each example set individually.
Inside the loop, you can use a branch operator and discard those example sets, that don't fit requirement.
See the example process below for an example.
Best,
David<process version="9.6.000-BETA"><br><div> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process" origin="GENERATED_TUTORIAL"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="SYSTEM"/><br> <process expanded="true"><br> <operator activated="true" class="concurrency:loop" compatibility="9.6.000-BETA" expanded="true" height="82" name="Loop" width="90" x="179" y="34"><br> <parameter key="number_of_iterations" value="20"/><br> <parameter key="iteration_macro" value="iteration"/><br> <parameter key="reuse_results" value="false"/><br> <parameter key="enable_parallel_execution" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="generate_macro" compatibility="9.6.000-BETA" expanded="true" height="68" name="Generate Macro" width="90" x="179" y="34"><br> <list key="function_descriptions"><br> <parameter key="random" value="round(rand()*100)"/><br> </list><br> <description align="center" color="transparent" colored="false" width="126">Generate a randon number between 1 and 100</description><br> </operator><br> <operator activated="true" class="generate_data" compatibility="9.6.000-BETA" expanded="true" height="68" name="Generate Data" width="90" x="380" y="34"><br> <parameter key="target_function" value="random"/><br> <parameter key="number_examples" value="%{random}"/><br> <parameter key="number_of_attributes" value="5"/><br> <parameter key="attributes_lower_bound" value="-10.0"/><br> <parameter key="attributes_upper_bound" value="10.0"/><br> <parameter key="gaussian_standard_deviation" value="10.0"/><br> <parameter key="largest_radius" value="10.0"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> <parameter key="datamanagement" value="double_array"/><br> <parameter key="data_management" value="auto"/><br> </operator><br> <connect from_op="Generate Data" from_port="output" to_port="output 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="loop_collection" compatibility="9.6.000-BETA" expanded="true" height="82" name="Loop Collection" width="90" x="715" y="34"><br> <parameter key="set_iteration_macro" value="false"/><br> <parameter key="macro_name" value="iteration"/><br> <parameter key="macro_start_value" value="1"/><br> <parameter key="unfold" value="false"/><br> <process expanded="true"><br> <operator activated="true" class="branch" compatibility="9.6.000-BETA" expanded="true" height="82" name="Branch" width="90" x="447" y="34"><br> <parameter key="condition_type" value="min_examples"/><br> <parameter key="condition_value" value="50"/><br> <parameter key="expression" value=""/><br> <parameter key="io_object" value="ANOVAMatrix"/><br> <parameter key="return_inner_output" value="true"/><br> <process expanded="true"><br> <connect from_port="condition" to_port="input 1"/><br> <portSpacing port="source_condition" spacing="0"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_input 1" spacing="0"/><br> <portSpacing port="sink_input 2" spacing="0"/><br> </process><br> <process expanded="true"><br> <portSpacing port="source_condition" spacing="0"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_input 1" spacing="0"/><br> <portSpacing port="sink_input 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="single" to_op="Branch" to_port="condition"/><br> <connect from_op="Branch" from_port="input 1" to_port="output 1"/><br> <portSpacing port="source_single" spacing="0"/><br> <portSpacing port="sink_output 1" spacing="0"/><br> <portSpacing port="sink_output 2" spacing="0"/><br> <description align="center" color="yellow" colored="false" height="245" resized="false" width="180" x="408" y="130">Here the branch condition is minimum number of examples.<br/><br/>If it's over 50, the example set is passed through, if not it's discarded.<br/><br/>The same logic could be applied on number of attributes, or number of missings.</description><br> </process><br> </operator><br> <connect from_op="Loop" from_port="output 1" to_op="Loop Collection" to_port="collection"/><br> <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="142" y="142">Generate a collection with 20 example sets,<br/>with a random number of example (between 1 and 100)</description><br> <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="664" y="159">Loop through all example sets in the collection and evaluate the number of examples</description><br> </process><br> </operator><br></process></div>
3 -
Oh wow. I did not know the branch operator could have empty connections inside. That solves it. I was trying to do just that but it was getting really complex to filter out the "else" examples out.Thanks to both of you!2