Using Altair Monarch to extract info from users guide

EdT_21448
EdT_21448 New Altair Community Member
edited September 2022 in Community Q&A

Hi,

I tried to use Monarch in a user's guide manual, such the one attached, but the information is too fragmented.

 

My question is:

Is it possible to extract information from a document like this?

 

My reason for this demand:

I need to collect information from dozens of manuals, and put it in a database, at least 50% of each document has data tha can be extracted.

Tagged:

Answers

  • Clinton Chee_22243
    Clinton Chee_22243 New Altair Community Member
    edited January 4

    Hi There,

    When you "tried to use Monarch ....", I'm assuming you are using either the Monarch DataPrepStudio or Monarch Classic - PDF Extractor? Of course it is still possible to extract information - but as you say "the information is too fragmented" because different sections has got their own format and styling.

    Assuming you have not tried this, one thing that may be of interest is the special tool called "PDF Table Extractor". It can be accessed from Monarch DataPrep Studio, click Open Data -> PDF & Text -> PDF Table Extractor

    Some videos that may be helpful for new users of PDF Table Extractor:

    https://www.youtube.com/watch?v=f1HgpyG3Dts

    https://www.youtube.com/watch?v=7j8FZ72hENg

    Basically, inside PDF Table Extractor, once you've chosen the PDF, click the Auto-Define button, and select All Pages. Wait a few seconds and it will try its best to capture all tables in the PDF. After that, you can review and adjust the table capture borders. Although it may require user to adjust the tables (by dragging the borders), this may still save over 80% of time compared to manual copy and paste into a spreadsheet / database.

    Thanks

     

     

     

  • CPorthouse
    CPorthouse
    Altair Employee
    edited August 2022

    All depends on what data exactly you are looking to extract.  Clinton mentioned the table extractor and that could be a good start.  You may also need to look at floating or regex traps.  It may require a lot of clean up once you pull the data in.

    One of the hurdles with the PDF file provided is that there are images placed on the page and Monarch removes those and tries to place the text as you would see it.  I did try and save the PDF as a text file and noticed the text is not that much better:

    Screenshot of PDF:

    image

    Copied from PDF:

    SEVERE waRning SignS foR inStaLLation

    Text editor version:

    SEVERE waRning SignS foR inStaLLation
    WARNING

    Monarch version:

    WARNING SEVERE waRning SignS foR inStaLLation

    Basically, within Monarch, you are seeing the actual representation of the underlying text without all of the PDF formatting.