Extracting PDF Images

Al_22614
Al_22614 Altair Community Member
edited March 2023 in Community Q&A
Hello All:

I am working with a large volume of Bank Statements containing check images which I need to extract and categorize for subsequent retrieval.  Has anyone had success extracting images from PDF files in the manner I have described.  

As this will be a recurring requirement, I certainly would like to automate the process using Excel VBA, vbscript or java/javascript.

Any suggestions would be appreciated.

Thanks...

------------------------------
Al Rice
------------------------------
Tagged:

Answers

  • Steve_Caiels
    Steve_Caiels
    Altair Employee
    edited April 2020
    Hi Al,

    I assume these are image based PDF files, so you just get a blank window when you open them in Monarch.  Is that correct?

    If so, you will need to run them through an OCR tool of your choice.  These tools will attempt to convert the image into searchable text that should then be available within Monarch.

    The success will depend largely on the quality of the image in the PDF.  If they are sharp, straight and of high enough resolution, you should get excellent results with the machine printed data.  But if they are fuzzy or scanned at a slight angle, the success rate diminishes rapidly.

    Handwritten amounts and dates will require ICR instead of OCR.  Tools that offer this tend to be more expensive will almost certainly have much lower accuracy. Even then, they require block written characters.  I don't believe there are any ICR tools that will reliably convert handwritten cursive text.

    Please be aware that the accuracy of the extracted characters is beyond our control.  Whatever you see in the report window of Monarch will be what the OCR/ICR tool has generated.  There is no interpretation going on within Monarch.

    If you are looking to use the final solution within Automator, then you need to choose a tool that has an API such as ABBYY Finereader.  Simple command line tools may not work in Automator if they call a GUI, even if that GUI does not require any interaction.

    Regards,
    Steve.

    ------------------------------
    Steve Caiels
    Professional Services
    Altair
    ------------------------------
    -------------------------------------------
    Original Message:
    Sent: 04-08-2020 02:47 PM
    From: Al Rice
    Subject: Extracting PDF Images

    Hello All:

    I am working with a large volume of Bank Statements containing check images which I need to extract and categorize for subsequent retrieval.  Has anyone had success extracting images from PDF files in the manner I have described.

    As this will be a recurring requirement, I certainly would like to automate the process using Excel VBA, vbscript or java/javascript.

    Any suggestions would be appreciated.

    Thanks...

    ------------------------------
    Al Rice
    ------------------------------
    "
  • Al_22614
    Al_22614 Altair Community Member
    edited April 2020

    Hi Al,

    I assume these are image based PDF files, so you just get a blank window when you open them in Monarch.  Is that correct?

    If so, you will need to run them through an OCR tool of your choice.  These tools will attempt to convert the image into searchable text that should then be available within Monarch.

    The success will depend largely on the quality of the image in the PDF.  If they are sharp, straight and of high enough resolution, you should get excellent results with the machine printed data.  But if they are fuzzy or scanned at a slight angle, the success rate diminishes rapidly.

    Handwritten amounts and dates will require ICR instead of OCR.  Tools that offer this tend to be more expensive will almost certainly have much lower accuracy. Even then, they require block written characters.  I don't believe there are any ICR tools that will reliably convert handwritten cursive text.

    Please be aware that the accuracy of the extracted characters is beyond our control.  Whatever you see in the report window of Monarch will be what the OCR/ICR tool has generated.  There is no interpretation going on within Monarch.

    If you are looking to use the final solution within Automator, then you need to choose a tool that has an API such as ABBYY Finereader.  Simple command line tools may not work in Automator if they call a GUI, even if that GUI does not require any interaction.

    Regards,
    Steve.

    ------------------------------
    Steve Caiels
    Professional Services
    Altair
    ------------------------------
    -------------------------------------------
    Original Message:
    Sent: 04-08-2020 02:47 PM
    From: Al Rice
    Subject: Extracting PDF Images

    Hello All:

    I am working with a large volume of Bank Statements containing check images which I need to extract and categorize for subsequent retrieval.  Has anyone had success extracting images from PDF files in the manner I have described.

    As this will be a recurring requirement, I certainly would like to automate the process using Excel VBA, vbscript or java/javascript.

    Any suggestions would be appreciated.

    Thanks...

    ------------------------------
    Al Rice
    ------------------------------
    "

    Thanks Steve. 

    Sorry for my slow response.  Your reply was helpful.  I was hopefully looking for a low impact solution, with capability of searching PDF documents and extracting any images to a .jpg, etc. files, with the name reference to the PDF file name and page number.  Since Monarch does not have this capability, I will look for an effective 3rd party tool, which can be call from an automation script.

    Thanks and again, and I am always impressed with responsiveness in this forum.

    Stay safe and don't forget to wash your hands :-)

    ------------------------------
    Al Rice
    ------------------------------
    -------------------------------------------
    Original Message:
    Sent: 04-09-2020 05:57 AM
    From: Steve Caiels
    Subject: Extracting PDF Images

    Hi Al,

    I assume these are image based PDF files, so you just get a blank window when you open them in Monarch.  Is that correct?

    If so, you will need to run them through an OCR tool of your choice.  These tools will attempt to convert the image into searchable text that should then be available within Monarch.

    The success will depend largely on the quality of the image in the PDF.  If they are sharp, straight and of high enough resolution, you should get excellent results with the machine printed data.  But if they are fuzzy or scanned at a slight angle, the success rate diminishes rapidly.

    Handwritten amounts and dates will require ICR instead of OCR.  Tools that offer this tend to be more expensive will almost certainly have much lower accuracy. Even then, they require block written characters.  I don't believe there are any ICR tools that will reliably convert handwritten cursive text.

    Please be aware that the accuracy of the extracted characters is beyond our control.  Whatever you see in the report window of Monarch will be what the OCR/ICR tool has generated.  There is no interpretation going on within Monarch.

    If you are looking to use the final solution within Automator, then you need to choose a tool that has an API such as ABBYY Finereader.  Simple command line tools may not work in Automator if they call a GUI, even if that GUI does not require any interaction.

    Regards,
    Steve.

    ------------------------------
    Steve Caiels
    Professional Services
    Altair
    ------------------------------

    Original Message:
    Sent: 04-08-2020 02:47 PM
    From: Al Rice
    Subject: Extracting PDF Images

    Hello All:

    I am working with a large volume of Bank Statements containing check images which I need to extract and categorize for subsequent retrieval.  Has anyone had success extracting images from PDF files in the manner I have described.

    As this will be a recurring requirement, I certainly would like to automate the process using Excel VBA, vbscript or java/javascript.

    Any suggestions would be appreciated.

    Thanks...

    ------------------------------
    Al Rice
    ------------------------------
    "
  • Richard Bret
    Richard Bret New Altair Community Member
    edited March 2023

    If anyone still needs any help, I can put in my two-pence here. It's definitely possible to extract images from PDF files using automation tools like Excel VBA or JavaScript. One option you might consider is using an OCR engine like Smart Engines to detect and extract the check images from the PDFs. From there, you could use VBA or JavaScript to sort and categorize the images based on your needs.
    Another approach could be to use Python with the PyPDF2 library to extract the images from the PDFs. This would require some programming knowledge, but there are plenty of resources available online to help you get started.