"MySQL, PDFs and text mining"

Question

Yep,

I know it's a bit a weird title but here comes the origin of it. I got a rather large repository of documents (most PDFs, some HTMLs and some TXT files) and they are stored within a database (yes, a single field of every record of my table contains a whole PDF besides some fields describing the doctype, language, origin and a label). The reason that a database keeps the files instead of a filesystem is given by the fact that we want to enable access from outside to the files using some simple PhP files. 
This database will contain a the end also some properties gathered with RM concerning these PDFs...

However, we want also to explore all PDFs and look which are 'familiar' and which are not (yes, using RM and the WVTool thus). 
The problem now is this: 
1/ Accessing a MySQL database from within RM is a piece of cake (thank you RM-developpers!), saying which field to pick as well. 
2/ Using the WVTool on a directory of PDFs is easy as well

But... is there a way to forward the MySQL stream (containing the PDFs actually) to the TextInput Method of the WVTool and make the WVTool use these fields of the different records as files ? Or do I need to use another method than TextInput to perform this task.

Thus in summary: how can I replace the directory or URL input option by a kind of ExampleSet input option ?

BTW, I'm working on a Linux FC8 system if this can help to solve/circumvent this problem. 
All help is greatly appreciated.

(A possible solution will be to start from PDFs on a filesystem and load them together with the gathered data into the MySQL database using RM, but I want to avoid this solution, because we are working in a project with different teams, some offer the data (PDFs, TXTs, ..), I'm doing the text mining, and still others will use the outcome. Hence a web-accessible database is to be preferred over a file-system...)

Best, 
Patrick

Legacy User · Answer

Hi Patrick,

Thank you for the tip, it is working now indeed !!
Rocky.

pdemaziere · Answer

Hi Rocky, upgrading to java version 6 might solve you problems Patrick Hello Tobias, I still don't manage to get the damned thing work, I guess I made some crucial mistake somewhere or did not explain something well: In my database the whole PDF is loaded as a blob, which makes that the content of this database field is identical to what you get if you do "more filename.pdf" instead of "acroread filename.pdf" So compared to the outcome of reading the directory with the PDFs I still get not the same results. RM still treats the filecontent as one "variable which makes it look like the whole string tokenization and stemming did not take place.... Here's the code for the file-based approach: And here for the database-approach: So what am I doing wrong ? PS: DocContent is exactly what you see when you do a "cat filename.pdf"

Legacy User · Answer

Hi the forum,

I am on Debian Linux with Intel Core 2 Duo processors with 1 GBytes RAM. I have downloaded and installed RapidMiner 4.3 Community : it works, it rocks !
Now I have downloaded the plugins and put them in "RapidMiner/lib" directory. At opening, RM says : "unable to load <text plugin related jars> : bad version number". I have used the 4.3 version plugins you provided.

Any idea ?

Rocky.