Combining extraction methods for PDFs

Bob17_20299
Bob17_20299 Altair Community Member
edited October 2022 in Community Q&A

I'm using Data Prep Studio to parse the data in a PDF.  The PDF contains multiple tables as well as some ad-hoc text in various places.  The PDF table extractor works perfectly, but I'm unable to grab anything other than tables with this method.  Is it possible to combine techniques?  A sample file is attached. Any help would be appreciated.  I'm new to Monarch and still evaluating the free trial. For example, how would I grab "Ohio Capital Partners" from the top of the page in addition to all the other tables? I'd also like to capture the date from the top (July 31, 2020) and the name of the fund, (Ohio Capital Partners Onshore LP)

 

Answers

  • Baba_Majekodunmi_703
    Baba_Majekodunmi_703
    Altair Employee
    edited September 2022

    Hi Bob,

    Yes you can extract multiple table in Monarch. Please see the attached resulting workspace and classic files.

    In short, I built templates for extracting the data and then combined these templates. This will give you the ability to repeat the process as opposed to the ad-hoc nature of using table extractor.

    You may likely have more questions, don't hesitate to reach out if you do.

  • Bob17_21240
    Bob17_21240 New Altair Community Member
    edited September 2022

    Hi Baba,

    Thank you for your thoughtful and detailed response.  Your example is very helpful. I see how you appended tables - very useful!

    Each template is based on the same Ohio 7/31 PDF.  So how do I run this model on the next Ohio Capital document (8/31)? Do I have to "add" it to each template? That could get cumbersome.  Or, is there a way to add a PDF simultaneously to all the related templates?

    Thank you,

    Bob

     

  • Baba_Majekodunmi_703
    Baba_Majekodunmi_703
    Altair Employee
    edited September 2022

    Hi Baba,

    Thank you for your thoughtful and detailed response.  Your example is very helpful. I see how you appended tables - very useful!

    Each template is based on the same Ohio 7/31 PDF.  So how do I run this model on the next Ohio Capital document (8/31)? Do I have to "add" it to each template? That could get cumbersome.  Or, is there a way to add a PDF simultaneously to all the related templates?

    Thank you,

    Bob

     

    Hi Bob,

    It’s actually very simple. Because we used templates instead of the table extractor, the process is repeatable. In other words all you need to do is bring in the report for 8/31.

    There’s a feature in the Data Prep Studio called “Edit All File Paths”. You can see it when you right click on any of the tables in the data prep studio. When you click on it, and browse to the location of the 8/31 file, Monarch will update all of the locations of where the 7/31 file is with the 8/31 file.

    In short, bringing in the 8/31 file and having all these extraction methods applied should take seconds.

    Let me know if you’re able to do it successfully. I will try and add a screenshot and a video next week too for reference.

  • Bob17_20299
    Bob17_20299 Altair Community Member
    edited September 2022

    Hi Baba,

    Your sample model works really well and I've learned a lot from it, though I think I've hit a wall.  I tried running the model on another PDF (see attached).  Although the PDF appears to have the same structure and format, it's actually slightly different (the margin is different) and so each line is shifted when viewed in the Report Editor. All the traps therefore fail. 

    I tried modifying the document by removing the 'left hand spaces', but that messes up the Append model, which pulls out the category headers (Equities, Equity Indices, Fixed Income, etc).

    Is there a way to make the model resistant to this kind formatting variations in the PDF?

    Thank you for your time :)

  • Baba_Majekodunmi_703
    Baba_Majekodunmi_703
    Altair Employee
    edited October 2022

    Hi Baba,

    Your sample model works really well and I've learned a lot from it, though I think I've hit a wall.  I tried running the model on another PDF (see attached).  Although the PDF appears to have the same structure and format, it's actually slightly different (the margin is different) and so each line is shifted when viewed in the Report Editor. All the traps therefore fail. 

    I tried modifying the document by removing the 'left hand spaces', but that messes up the Append model, which pulls out the category headers (Equities, Equity Indices, Fixed Income, etc).

    Is there a way to make the model resistant to this kind formatting variations in the PDF?

    Thank you for your time :)

    Hi Bob,

     

    My apologies for the delay on this. I created a new version of the workspace that works with that second version. It's very interesting how slightly different the pdf is.

    Monarch can handle such a use case, but it would be ideal if the pdf/data source doesn't vary much. For example, the title of the first document had 'Ohio Capital Partners' in the header, and the second one had 'Ohio Capital' in the title.