Combining extraction methods for PDFs

Question

I'm using Data Prep Studio to parse the data in a PDF.  The PDF contains multiple tables as well as some ad-hoc text in various places.  The PDF table extractor works perfectly, but I'm unable to grab anything other than tables with this method.  Is it possible to combine techniques?  A sample file is attached. Any help would be appreciated.  I'm new to Monarch and still evaluating the free trial. For example, how would I grab "Ohio Capital Partners" from the top of the page in addition to all the other tables? I'd also like to capture the date from the top (July 31, 2020) and the name of the fund, (Ohio Capital Partners Onshore LP)

Ohio Capital 07-31-2020.pdf

Baba_Majekodunmi_703 · Answer

Hi Bob,

My apologies for the delay on this. I created a new version of the workspace that works with that second version. It's very interesting how slightly different the pdf is.

Monarch can handle such a use case, but it would be ideal if the pdf/data source doesn't vary much. For example, the title of the first document had 'Ohio Capital Partners' in the header, and the second one had 'Ohio Capital' in the title.

Ohio Capital Partners - Monarch Files.zip

Bob17_20299 · Answer

Hi Baba,

Your sample model works really well and I've learned a lot from it, though I think I've hit a wall.  I tried running the model on another PDF (see attached).  Although the PDF appears to have the same structure and format, it's actually slightly different (the margin is different) and so each line is shifted when viewed in the Report Editor. All the traps therefore fail.

I tried modifying the document by removing the 'left hand spaces', but that messes up the Append model, which pulls out the category headers (Equities, Equity Indices, Fixed Income, etc).

Is there a way to make the model resistant to this kind formatting variations in the PDF?

Thank you for your time :)

Ohio Capital Investor Summary 2.pdf

Baba_Majekodunmi_703 · Answer

Hi Bob,

It’s actually very simple. Because we used templates instead of the table extractor, the process is repeatable. In other words all you need to do is bring in the report for 8/31.

There’s a feature in the Data Prep Studio called “Edit All File Paths”. You can see it when you right click on any of the tables in the data prep studio. When you click on it, and browse to the location of the 8/31 file, Monarch will update all of the locations of where the 7/31 file is with the 8/31 file.

In short, bringing in the 8/31 file and having all these extraction methods applied should take seconds.

Let me know if you’re able to do it successfully. I will try and add a screenshot and a video next week too for reference.