Automated tabular pdf data extraction

Ashish D
Ashish D Altair Community Member
edited January 2021 in Community Q&A

Hi Team,

I am using trail version of monarch v.2020 (v.16) for extracting the tabular data from pdf pages based on some keyword search for exploring purpose. We are able to get the tabular data from this version of Monarch. But I need to load and do manual things for each and every pdf for extracting tabular data. There are huge number of pdf files as input, so its tedious task do same thing again and again. Is it there any automated way in Monarch latest, so we can extract tabular data from multiple pdf pages based on some keyword search like., "ABC Statement".

Please suggest and help.

Thanks & Regards,

Ashish Deshwal

 

Answers

  • CPorthouse
    CPorthouse
    Altair Employee
    edited January 2021

    How are you extracting the data from PDF?  Are you using templates or Table Extractor? If the PDF files are similar and you are looking to replicate what you are doing, then building out templates will work better for you than table extractor. 

    Once your workspace/model has been built, you can bring in multiple PDF files at the same time (limited by your system resources).

    In either case, there is no automated way with just Monarch alone.  We do have a companion product called Monarch Server - Automator that allows you to automate your workspaces and models.  However, you cannot use Automator with Table Extractor.

     

     

  • Ashish D
    Ashish D Altair Community Member
    edited January 2021

    Hi Chris,

    Thanks for replying.

    Yes we are extracting tabular data by using table extractor. 

    Table extractor work better by using its inbuilt functionality for data extraction and cleansing. But as you said we can not use this for repetitive purpose as just like model or template. We need to look for model approach. 

    Actually we have pdf files from different vendors in their separate format. So in that case we need to create model/template for each vendor. It will be helpful for when new files come for same vendor I guess. 

    But one thing is there, we have multiple pages in pdf file, and we search the relevant pages from which we need to extract tabular data base on the keywords. I am not sure is it possible can we incorporate that search thing in model. Can you please suggest.

    Also, the tabular data in the pages not having simple format. So, its quite challenging to building a model. Can you kindly provide any video tutorial or link, where model is trained on pdf tabular data.

    Best Regards,
    Ashish Deshwal

  • Baba_Majekodunmi_703
    Baba_Majekodunmi_703
    Altair Employee
    edited January 2021
    Ashish D said:

    Hi Chris,

    Thanks for replying.

    Yes we are extracting tabular data by using table extractor. 

    Table extractor work better by using its inbuilt functionality for data extraction and cleansing. But as you said we can not use this for repetitive purpose as just like model or template. We need to look for model approach. 

    Actually we have pdf files from different vendors in their separate format. So in that case we need to create model/template for each vendor. It will be helpful for when new files come for same vendor I guess. 

    But one thing is there, we have multiple pages in pdf file, and we search the relevant pages from which we need to extract tabular data base on the keywords. I am not sure is it possible can we incorporate that search thing in model. Can you please suggest.

    Also, the tabular data in the pages not having simple format. So, its quite challenging to building a model. Can you kindly provide any video tutorial or link, where model is trained on pdf tabular data.

    Best Regards,
    Ashish Deshwal

    Hi Ashish,

     

    We haven't forgotten about your request here. We're going to share a link soon that has a library of demonstrations, trainings, and how to get certified in Monarch. Stay tuned, and thanks for your patience.

     

    Best Regards,

    Baba