Using Altair Monarch to extract info from users guide

Question

Hi,

I tried to use Monarch in a user's guide manual, such the one attached, but the information is too fragmented.

My question is:

Is it possible to extract information from a document like this?

My reason for this demand:

I need to collect information from dozens of manuals, and put it in a database, at least 50% of each document has data tha can be extracted.

samsung-appliance-rs261mdbp-use-and-care-manual.pdf

CPorthouse · Answer

All depends on what data exactly you are looking to extract.  Clinton mentioned the table extractor and that could be a good start.  You may also need to look at floating or regex traps.  It may require a lot of clean up once you pull the data in.

One of the hurdles with the PDF file provided is that there are images placed on the page and Monarch removes those and tries to place the text as you would see it.  I did try and save the PDF as a text file and noticed the text is not that much better:

Screenshot of PDF:

Copied from PDF:

SEVERE waRning SignS foR inStaLLation

Text editor version:

SEVERE waRning SignS foR inStaLLation 
WARNING

Monarch version:

WARNING SEVERE waRning SignS foR inStaLLation

Basically, within Monarch, you are seeing the actual representation of the underlying text without all of the PDF formatting.

Clinton Chee_22243 · Answer

Hi There,

When you "tried to use Monarch ....", I'm assuming you are using either the Monarch DataPrepStudio or Monarch Classic - PDF Extractor? Of course it is still possible to extract information - but as you say "the information is too fragmented" because different sections has got their own format and styling.

Assuming you have not tried this, one thing that may be of interest is the special tool called "PDF Table Extractor". It can be accessed from Monarch DataPrep Studio, click Open Data -> PDF & Text -> PDF Table Extractor

Some videos that may be helpful for new users of PDF Table Extractor:

https://www.youtube.com/watch?v=f1HgpyG3Dts

https://www.youtube.com/watch?v=7j8FZ72hENg

Basically, inside PDF Table Extractor, once you've chosen the PDF, click the Auto-Define button, and select All Pages. Wait a few seconds and it will try its best to capture all tables in the PDF. After that, you can review and adjust the table capture borders. Although it may require user to adjust the tables (by dragging the borders), this may still save over 80% of time compared to manual copy and paste into a spreadsheet / database.

Thanks