[SOLVED] Read PDF with images for Text Mining

johan_CG
johan_CG New Altair Community Member
edited November 5 in Community Q&A
Hi everybody

I'm new on RapidMiner and I need to build a process which counts the words in a folder containing thousands files of differents type including PDF.
I built a first process only for HTML files which works as I want but I have a problem with some PDF files.

In fact when a PDF has at least one image inside it is unreadable whereas there is no problem with other PDF.

I work on Windows 7 64 bits, RapidMiner 5.3.0 64bits with all extentions installed and Java 64 bits.
Currently I use only the "Read Document" Operator. When I run the process I have the following pop-up message:

Process Failed
The setup does not seem to contain any obvious errors, but you should check the log messages or activate the debug mode in the settings dialog in order to get more information about this problem.
And the following log:

Feb 27, 2013 2:12:24 PM INFO: No filename given for result file, using stdout for logging results!
Feb 27, 2013 2:12:24 PM INFO: Process starts
Feb 27, 2013 2:12:24 PM INFO: Loading initial data.
Feb 27, 2013 2:12:24 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 27, 2013 2:12:24 PM SEVERE: Here:           Process[1] (Process)
          subprocess 'Main Process'
      ==>   +- Read Document[1] (Read Document)
Feb 27, 2013 2:12:24 PM SEVERE: java.lang.NullPointerException
Can someone help me to mining the text in a PDF file which has images inside?

Thanks in advance for your replies
Johan
Tagged:

Answers

  • Hello Johan

    I encountered a similar problem and worked round it using an external tool called pdfbox.

    I wrote about it here http://rapidminernotes.blogspot.com/2012/07/converting-pdf-to-text.html

    Fairly complicated but I hope it helps.

    regards

    Andrew
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    good news! We are actually using Pdfbox to read the PDF files, however the version used was kinda outdated :o
    If it doesn't break anything, the next Text Processing Extension release should contain an up to date version of said library and your PDF files should be read in no problem.

    Regards,
    Marco
  • johan_CG
    johan_CG New Altair Community Member
    Hi,

    Thank you for your quick replies.

    @ Andrew : I tried to follow your note but I don't find how to install Pdfbox
      I found this page http://rapid-i.com/wiki/index.php?title=RapidMiner_Installation_Guide which says:

    Installing Plugins
    If you want to use one of the RapidMiner plugins, just download the plugin Jar file and copy it into the subdirectory lib/plugins in your RapidMiner program directory. Windows users can also simply use the provided installer (.exe) files.
    But it doesn't work.

    @ Marco : In fact this is a very good news!  :D
      Unfortunately I've got an other problem I have to solve : I can't update my RapidMiner.
      When I click on "Help" >  "Updates and Extensions (Marketplace)..." > "Updates" I see the RapidMiner 5.3.5 then I click to "Intall 1 packages" and follow the step.
      Then RapidMiner restart but the new version isn't installed and it's still the 5.3.0 version
      Do you have any idea of solution?

    Thanks in advance
    Johan
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    what OS are you using and how do you start RapidMiner? RapidMiner needs to be started with a script from the scripts folder or the .exe file (but not via the RapidMiner.jar itself), and it may need admin privileges depending on where you installed it. If all else fails, you can always download the latest RM release .zip file and extract & overwrite everything in your RapidMiner directory, though that is only a last resort and should not be necessary.

    Regards,
    Marco
  • johan_CG
    johan_CG New Altair Community Member
    Hi Marco,

    Thank you for your advise to solve the problem.  :)
    I'm using Windows 7 and the .exe file to start RapidMiner. I also have the admin privileges.
    I found a way to solve the update: I download the lastest installer on Rapid-I website and when I run the installer, it automaticaly updates my version.

    Do you have a date for the next Text Processing Extension release?  ???

    Thanks in advance
    Johan
  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    unfortunately I do not have a date.
    However if you are familiar with Eclipse or another Java IDE of your choice, you can download the updated sources (tomorrow or the day after) and build the extension yourself if you don't want to wait.
    SVN location: http://svn.code.sf.net/p/rapidminer/code/Plugins/TextProcessing/Unuk/

    Regards,
    Marco
  • johan_CG
    johan_CG New Altair Community Member
    Hi,

    Thank you Marco.
    For the moment I will wait but if I build myself the extension I will give you feedback.

    Regards
    Johan
  • johan_CG
    johan_CG New Altair Community Member
    Hi Marco

    Finally I used the sources because I can't wait for the update.
    It works very well and I also add an operator which maybe (I think) will be very usefull according to other posts in the forum. :P

    Thank you very much to all of you who work on RapidMiner and its extensions and those who helped me  ;)
    Regards
    Johan