Read PDF Tables Extension - Need to

miked
miked New Altair Community Member
edited November 5 in Community Q&A
Hello - I am trying to use the "Read PDF Tables" Extension. I have successfully read my PDF but it has been split out into 21 different example sets. I would like to use the "Select" operator to choose the Example sets that I need. I am running into some issues. "Select" only lets you pick on example set whereas I will need to select 5. Second - not all of the example sets are the same with only 5 of the 21 sheets having the attribute headings that I actually need. Would anyone have any ideas on how I can pull what I need from this set. I have been trying to use Loops but unsuccessfully. Thanks! 

Best Answers

  • miked
    miked New Altair Community Member
    Answer ✓
    Hi @sgenzer...Great thank you. That definitely helps narrow down which example sets have the attributes that I need. Would I then just follow @varunm1 method to connect the n amount of "Select" operators to Append the sets together? Is there a way of using a macro to count the example sets and just save "Select" loop n amount of times. If not..this should work for now and I thank you both for your help. 
    -Mike
  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓
    hi @miked if all the examplesets are the same (or similar), I'd just drop an Append(Superset) on the end. Like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="pdf_table_extraction:pdfs2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Read PDF Tables" width="90" x="45" y="34">
            <parameter key="resource_type" value="file"/>
            <parameter key="attribute" value=""/>
            <parameter key="tune extraction criteria" value="false"/>
            <parameter key="discard tables with no rows" value="false"/>
            <parameter key="discard empty attributes" value="false"/>
            <parameter key="heuristic ratio for table content" value="0.65"/>
            <parameter key="tune edge detection criteria" value="false"/>
            <parameter key="grayscale intensity threshold" value="25"/>
            <parameter key="minimum width of horizontal edge" value="50"/>
            <parameter key="minimum height of vertical edge" value="10"/>
            <parameter key="maximum cell corner distance" value="10"/>
            <parameter key="required text lines for edge" value="4"/>
            <parameter key="required cells for table" value="4"/>
            <parameter key="point snap distance threshold" value="8.0"/>
            <parameter key="table padding amount" value="1.0"/>
            <parameter key="identical table overlap ratio" value="0.9"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">enter the attribute of example sets you want to keep</description>
              </operator>
              <operator activated="true" class="branch" compatibility="9.6.000" expanded="true" height="82" name="Branch" width="90" x="179" y="34">
                <parameter key="condition_type" value="min_attributes"/>
                <parameter key="condition_value" value="1"/>
                <parameter key="expression" value=""/>
                <parameter key="io_object" value="ANOVAMatrix"/>
                <parameter key="return_inner_output" value="true"/>
                <process expanded="true">
                  <connect from_port="condition" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="132" y="13">keep the ExampleSet</description>
                </process>
                <process expanded="true">
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="162" y="13">do not keep the ExampleSet</description>
                </process>
                <description align="center" color="transparent" colored="false" width="126">branch to some minimum # of attributes (1?)</description>
              </operator>
              <connect from_port="single" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Branch" to_port="condition"/>
              <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="operator_toolbox:advanced_append" compatibility="2.3.000" expanded="true" height="82" name="Append (Superset)" width="90" x="313" y="34"/>
          <connect from_op="Read PDF Tables" from_port="collection of pdf data tables as example sets" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Append (Superset)" to_port="example set 1"/>
          <connect from_op="Append (Superset)" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    



  • ey1
    ey1 New Altair Community Member
    edited February 2020 Answer ✓
    If you are still thinking on a way how to automate the filtering of collection, you can think about different condition types in the Branch operator in the process proposed by @sgenzer such as min or max number of attributes or examples. If you want to use names of attributes, just inspect if Read PDF Tables operator gives you the attribute names you want (its not a guarantee, since it depends on detection and extraction method) in the output ExampleSet(s) but if it does once, it will do always. In this case, you can use the attribute names in macros and try to use complex expression in Branch operator to filter out ExampleSets with desired attribute name(s) and if they have exactly same header structure, you can Append them as @varunm1 suggested.
    I am attaching a test process for reference. It will log out an error message to give a hint if condition is not fulfilled.
    Cheers,
    Edwin

Answers

  • varunm1
    varunm1 New Altair Community Member
    edited February 2020
    Hello @miked

    Did you try using "multiply operator" after the collection and then connect the five select operators to pick each one of them based on their index in the collection? If all 5 have the same attribute names you can use append operator to append them into a single example set as well.

    There may be some other solutions as well. @David_A or @mschmitz any ideas here?
  • miked
    miked New Altair Community Member
    @varunm1
    Thanks for the suggestion. That would definitely work for now. I think what I'm looking for is a bit more automation. My fear is that it won't always be the same 5 example sets. I was hoping for some way to identify which of those example sets has the attributes that I am looking for and pull those sets regardless of how many there are. 
    -Mike
  • varunm1
    varunm1 New Altair Community Member
    Hi Mike,

    Yep understood. Lets see if anyone responds 
  • sgenzer
    sgenzer
    Altair Employee
    hi @miked I have worked with this situation before. I usually use "Loop Collection" afterwards and then check out each ExampleSet to see if has the attributes I'm looking for. Something like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="pdf_table_extraction:pdfs2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Read PDF Tables" width="90" x="112" y="136">
            <parameter key="resource_type" value="file"/>
            <parameter key="attribute" value=""/>
            <parameter key="tune extraction criteria" value="false"/>
            <parameter key="discard tables with no rows" value="false"/>
            <parameter key="discard empty attributes" value="false"/>
            <parameter key="heuristic ratio for table content" value="0.65"/>
            <parameter key="tune edge detection criteria" value="false"/>
            <parameter key="grayscale intensity threshold" value="25"/>
            <parameter key="minimum width of horizontal edge" value="50"/>
            <parameter key="minimum height of vertical edge" value="10"/>
            <parameter key="maximum cell corner distance" value="10"/>
            <parameter key="required text lines for edge" value="4"/>
            <parameter key="required cells for table" value="4"/>
            <parameter key="point snap distance threshold" value="8.0"/>
            <parameter key="table padding amount" value="1.0"/>
            <parameter key="identical table overlap ratio" value="0.9"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="246" y="136">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">enter the attribute of example sets you want to keep</description>
              </operator>
              <operator activated="true" class="branch" compatibility="9.6.000" expanded="true" height="82" name="Branch" width="90" x="179" y="34">
                <parameter key="condition_type" value="min_attributes"/>
                <parameter key="condition_value" value="1"/>
                <parameter key="expression" value=""/>
                <parameter key="io_object" value="ANOVAMatrix"/>
                <parameter key="return_inner_output" value="true"/>
                <process expanded="true">
                  <connect from_port="condition" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="132" y="13">keep the ExampleSet</description>
                </process>
                <process expanded="true">
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="162" y="13">do not keep the ExampleSet</description>
                </process>
                <description align="center" color="transparent" colored="false" width="126">branch to some minimum # of attributes (1?)</description>
              </operator>
              <connect from_port="single" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Branch" to_port="condition"/>
              <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read PDF Tables" from_port="collection of pdf data tables as example sets" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    



  • miked
    miked New Altair Community Member
    Answer ✓
    Hi @sgenzer...Great thank you. That definitely helps narrow down which example sets have the attributes that I need. Would I then just follow @varunm1 method to connect the n amount of "Select" operators to Append the sets together? Is there a way of using a macro to count the example sets and just save "Select" loop n amount of times. If not..this should work for now and I thank you both for your help. 
    -Mike
  • sgenzer
    sgenzer
    Altair Employee
    Answer ✓
    hi @miked if all the examplesets are the same (or similar), I'd just drop an Append(Superset) on the end. Like this:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="pdf_table_extraction:pdfs2exampleset_operator" compatibility="0.2.001" expanded="true" height="68" name="Read PDF Tables" width="90" x="45" y="34">
            <parameter key="resource_type" value="file"/>
            <parameter key="attribute" value=""/>
            <parameter key="tune extraction criteria" value="false"/>
            <parameter key="discard tables with no rows" value="false"/>
            <parameter key="discard empty attributes" value="false"/>
            <parameter key="heuristic ratio for table content" value="0.65"/>
            <parameter key="tune edge detection criteria" value="false"/>
            <parameter key="grayscale intensity threshold" value="25"/>
            <parameter key="minimum width of horizontal edge" value="50"/>
            <parameter key="minimum height of vertical edge" value="10"/>
            <parameter key="maximum cell corner distance" value="10"/>
            <parameter key="required text lines for edge" value="4"/>
            <parameter key="required cells for table" value="4"/>
            <parameter key="point snap distance threshold" value="8.0"/>
            <parameter key="table padding amount" value="1.0"/>
            <parameter key="identical table overlap ratio" value="0.9"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.6.000" expanded="true" height="82" name="Loop Collection" width="90" x="179" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <description align="center" color="transparent" colored="false" width="126">enter the attribute of example sets you want to keep</description>
              </operator>
              <operator activated="true" class="branch" compatibility="9.6.000" expanded="true" height="82" name="Branch" width="90" x="179" y="34">
                <parameter key="condition_type" value="min_attributes"/>
                <parameter key="condition_value" value="1"/>
                <parameter key="expression" value=""/>
                <parameter key="io_object" value="ANOVAMatrix"/>
                <parameter key="return_inner_output" value="true"/>
                <process expanded="true">
                  <connect from_port="condition" to_port="input 1"/>
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="132" y="13">keep the ExampleSet</description>
                </process>
                <process expanded="true">
                  <portSpacing port="source_condition" spacing="0"/>
                  <portSpacing port="source_input 1" spacing="0"/>
                  <portSpacing port="sink_input 1" spacing="0"/>
                  <portSpacing port="sink_input 2" spacing="0"/>
                  <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="162" y="13">do not keep the ExampleSet</description>
                </process>
                <description align="center" color="transparent" colored="false" width="126">branch to some minimum # of attributes (1?)</description>
              </operator>
              <connect from_port="single" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Branch" to_port="condition"/>
              <connect from_op="Branch" from_port="input 1" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="operator_toolbox:advanced_append" compatibility="2.3.000" expanded="true" height="82" name="Append (Superset)" width="90" x="313" y="34"/>
          <connect from_op="Read PDF Tables" from_port="collection of pdf data tables as example sets" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_op="Append (Superset)" to_port="example set 1"/>
          <connect from_op="Append (Superset)" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    



  • ey1
    ey1 New Altair Community Member
    edited February 2020 Answer ✓
    If you are still thinking on a way how to automate the filtering of collection, you can think about different condition types in the Branch operator in the process proposed by @sgenzer such as min or max number of attributes or examples. If you want to use names of attributes, just inspect if Read PDF Tables operator gives you the attribute names you want (its not a guarantee, since it depends on detection and extraction method) in the output ExampleSet(s) but if it does once, it will do always. In this case, you can use the attribute names in macros and try to use complex expression in Branch operator to filter out ExampleSets with desired attribute name(s) and if they have exactly same header structure, you can Append them as @varunm1 suggested.
    I am attaching a test process for reference. It will log out an error message to give a hint if condition is not fulfilled.
    Cheers,
    Edwin
  • miked
    miked New Altair Community Member
    @sgenzer
    That's fantastic thank you all!
    Two supplemental questions but not vital to solving the issue. 
    1 - I had 3 attributes that did not come through in the loop->select attributes so decided to just go with "all"..Two of the column headers is labeled in the PDF as "CurrentMonth's Sale" as well as "CYTD 2019" so assuming there are some limits to what Read PDF can do to as  @ey stated above?
    2 - If the example sets were not all the same...can I manipulate them in the collection or is it better to use "branch" and pull them out. 
    I'm a bit of a newbie especially with "Collections." I really appreciate the help of the group here. 
    -Mike
  • sgenzer
    sgenzer
    Altair Employee
    hi @miked glad everything is working for you! It's hard to answer your new questions here without really seeing some examples. There are some limitations to the Read PDF Tables operator - mostly because PDF tables come in a ton of different shapes and sizes.