Loading multiple pdf files
sdspieg
New Altair Community Member
I am trying to load a corpus of several pdf-files into RM. I selected 'Process documents from Files' from the text processing menu and selected the directory with the pdf-files. But when I run this process, it gives me the following error message :
Feb 28, 2012 2:30:04 AM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 28, 2012 2:30:04 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Feb 28, 2012 2:30:04 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException
Can you please help?
-Stephan
Feb 28, 2012 2:30:04 AM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 28, 2012 2:30:04 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Feb 28, 2012 2:30:04 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException
Can you please help?
-Stephan
0
Answers
-
Hi Stephan,
the information you provide is a bit sparse. Can you please post your process setup? Did you try another folder with another set of pdf files? You can find some useful hints on what to include into your questions here here.
Kind regards,
Marius0 -
My apologies. I'll try to explain in some more detail. We are trying (for a EU FP7-project) to run a corpus of about 1000 English (mostly) pdf-files (academic articles downloaded from EBSCO and categorized by academic discipline) through Rapidminer's textmining engine. We are interested in a number of different results:
* the key concepts that emerge from various subsets of this corpus (which are in separate - labeled - subfolders)
* the various n-grams that contain certain words in them (e.g. every n-gram with the word 'security' in it): which combinations occur most frequently in the text
* co-occurences of various words within certain 'windows' (say - 2 sentences) throughout the text.
* automatic clustering of all pdfs
* ...
All of this after having run the usual textmining processes of course (tokenization, stemming, etc.) But it seems to me that with the available information, we should be able to set up that entire process. All the help I am asking for is the very first step: to get the pdfs into Rapidminer.
I am trying to follow the Vancouver Data video on 'loading text into Rapidminer'. As explained there I click on 'Process Documents from Files' in the 'Text Processing' operators section. AT that point my screen already looks different from the video: the 'exa' and wor' handles are automatically connected to two 'res' handles on the right. I still click on 'Text Directories', and I input the folder where I have the first set of pdfs. I accept the suggested
Mar 11, 2012 3:38:04 AM INFO: Loading initial data.
Mar 11, 2012 3:38:05 AM SEVERE: Process failed: operator cannot be executed (4). Check the log messages...
Mar 11, 2012 3:38:05 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
==> +- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
Mar 11, 2012 3:38:05 AM SEVERE: java.lang.ArrayIndexOutOfBoundsException: 4
Thanks for any help.
-Stephan0 -
Hi Stephan,
thanks for your detailed description. Unfortunately I still can't guess from where the error results.
Which RapidMiner version do you use? If it is not the latest version (5.2.002), please update.
If the error still occurs, can you please post your process setup?
Did you try another folder with different pdf files as input to check if it is caused by a corrupted pdf file?
Best, Marius0 -
I haven't even built a process yet! Shouldn't I just be able to LOAD the corpus first (as on the video) and THEN set up the process?0
-
As soon as you drag the Process Documents from Text operator onto the process view, you have setup a process - admittedly a very simple one, but nevertheless a process Am I right that you get the error just after you clicked the blue play button? With that button you actually execute the process. If that is true, please post your process setup as described in this thread: http://rapid-i.com/rapidforum/index.php/topic,4654.0.html
I haven't seen the movie yet, so I can't tell you if anything is different from the movie
Best,
Marius0 -
When I enter the very same files as regular text files (batch-converted from those pdf-files), for instance, everything works perfectly. But not with pdf-files.0
-
Then it's probably a problem with a specific, probably corrupted, pdf file. I can't reproduce your problem with any pdf file on my computer. Maybe you can try your process on a subset of the pdfs in that folder, or try another folder, to see if the problem occurs with all pdfs or only with some specific ones.
Best,
Marius0