[SOLVED] Read PDF with images for Text Mining
johan_CG
New Altair Community Member
Hi everybody
I'm new on RapidMiner and I need to build a process which counts the words in a folder containing thousands files of differents type including PDF.
I built a first process only for HTML files which works as I want but I have a problem with some PDF files.
In fact when a PDF has at least one image inside it is unreadable whereas there is no problem with other PDF.
I work on Windows 7 64 bits, RapidMiner 5.3.0 64bits with all extentions installed and Java 64 bits.
Currently I use only the "Read Document" Operator. When I run the process I have the following pop-up message:
Thanks in advance for your replies
Johan
I'm new on RapidMiner and I need to build a process which counts the words in a folder containing thousands files of differents type including PDF.
I built a first process only for HTML files which works as I want but I have a problem with some PDF files.
In fact when a PDF has at least one image inside it is unreadable whereas there is no problem with other PDF.
I work on Windows 7 64 bits, RapidMiner 5.3.0 64bits with all extentions installed and Java 64 bits.
Currently I use only the "Read Document" Operator. When I run the process I have the following pop-up message:
And the following log:
Process Failed
The setup does not seem to contain any obvious errors, but you should check the log messages or activate the debug mode in the settings dialog in order to get more information about this problem.
Can someone help me to mining the text in a PDF file which has images inside?
Feb 27, 2013 2:12:24 PM INFO: No filename given for result file, using stdout for logging results!
Feb 27, 2013 2:12:24 PM INFO: Process starts
Feb 27, 2013 2:12:24 PM INFO: Loading initial data.
Feb 27, 2013 2:12:24 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 27, 2013 2:12:24 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Read Document[1] (Read Document)
Feb 27, 2013 2:12:24 PM SEVERE: java.lang.NullPointerException
Thanks in advance for your replies
Johan
Tagged:
0
Answers
-
Hello Johan
I encountered a similar problem and worked round it using an external tool called pdfbox.
I wrote about it here http://rapidminernotes.blogspot.com/2012/07/converting-pdf-to-text.html
Fairly complicated but I hope it helps.
regards
Andrew0 -
Hi,
good news! We are actually using Pdfbox to read the PDF files, however the version used was kinda outdated
If it doesn't break anything, the next Text Processing Extension release should contain an up to date version of said library and your PDF files should be read in no problem.
Regards,
Marco0 -
Hi,
Thank you for your quick replies.
@ Andrew : I tried to follow your note but I don't find how to install Pdfbox
I found this page http://rapid-i.com/wiki/index.php?title=RapidMiner_Installation_Guide which says:
But it doesn't work.
Installing Plugins
If you want to use one of the RapidMiner plugins, just download the plugin Jar file and copy it into the subdirectory lib/plugins in your RapidMiner program directory. Windows users can also simply use the provided installer (.exe) files.
@ Marco : In fact this is a very good news!
Unfortunately I've got an other problem I have to solve : I can't update my RapidMiner.
When I click on "Help" > "Updates and Extensions (Marketplace)..." > "Updates" I see the RapidMiner 5.3.5 then I click to "Intall 1 packages" and follow the step.
Then RapidMiner restart but the new version isn't installed and it's still the 5.3.0 version
Do you have any idea of solution?
Thanks in advance
Johan0 -
Hi,
what OS are you using and how do you start RapidMiner? RapidMiner needs to be started with a script from the scripts folder or the .exe file (but not via the RapidMiner.jar itself), and it may need admin privileges depending on where you installed it. If all else fails, you can always download the latest RM release .zip file and extract & overwrite everything in your RapidMiner directory, though that is only a last resort and should not be necessary.
Regards,
Marco0 -
Hi Marco,
Thank you for your advise to solve the problem.
I'm using Windows 7 and the .exe file to start RapidMiner. I also have the admin privileges.
I found a way to solve the update: I download the lastest installer on Rapid-I website and when I run the installer, it automaticaly updates my version.
Do you have a date for the next Text Processing Extension release? ???
Thanks in advance
Johan0 -
Hi,
unfortunately I do not have a date.
However if you are familiar with Eclipse or another Java IDE of your choice, you can download the updated sources (tomorrow or the day after) and build the extension yourself if you don't want to wait.
SVN location: http://svn.code.sf.net/p/rapidminer/code/Plugins/TextProcessing/Unuk/
Regards,
Marco0 -
Hi,
Thank you Marco.
For the moment I will wait but if I build myself the extension I will give you feedback.
Regards
Johan0 -
Hi Marco
Finally I used the sources because I can't wait for the update.
It works very well and I also add an operator which maybe (I think) will be very usefull according to other posts in the forum. :P
Thank you very much to all of you who work on RapidMiner and its extensions and those who helped me
Regards
Johan0