"MySQL, PDFs and text mining"

New Altair Community Member
Updated by Jocelyn
Yep,
I know it's a bit a weird title but here comes the origin of it. I got a rather large repository of documents (most PDFs, some HTMLs and some TXT files) and they are stored within a database (yes, a single field of every record of my table contains a whole PDF besides some fields describing the doctype, language, origin and a label). The reason that a database keeps the files instead of a filesystem is given by the fact that we want to enable access from outside to the files using some simple PhP files.
This database will contain a the end also some properties gathered with RM concerning these PDFs...
However, we want also to explore all PDFs and look which are 'familiar' and which are not (yes, using RM and the WVTool thus).
The problem now is this:
1/ Accessing a MySQL database from within RM is a piece of cake (thank you RM-developpers!), saying which field to pick as well.
2/ Using the WVTool on a directory of PDFs is easy as well
But... is there a way to forward the MySQL stream (containing the PDFs actually) to the TextInput Method of the WVTool and make the WVTool use these fields of the different records as files ? Or do I need to use another method than TextInput to perform this task.
Thus in summary: how can I replace the directory or URL input option by a kind of ExampleSet input option ?
BTW, I'm working on a Linux FC8 system if this can help to solve/circumvent this problem.
All help is greatly appreciated.
(A possible solution will be to start from PDFs on a filesystem and load them together with the gathered data into the MySQL database using RM, but I want to avoid this solution, because we are working in a project with different teams, some offer the data (PDFs, TXTs, ..), I'm doing the text mining, and still others will use the outcome. Hence a web-accessible database is to be preferred over a file-system...)
Best,
Patrick
I know it's a bit a weird title but here comes the origin of it. I got a rather large repository of documents (most PDFs, some HTMLs and some TXT files) and they are stored within a database (yes, a single field of every record of my table contains a whole PDF besides some fields describing the doctype, language, origin and a label). The reason that a database keeps the files instead of a filesystem is given by the fact that we want to enable access from outside to the files using some simple PhP files.
This database will contain a the end also some properties gathered with RM concerning these PDFs...
However, we want also to explore all PDFs and look which are 'familiar' and which are not (yes, using RM and the WVTool thus).
The problem now is this:
1/ Accessing a MySQL database from within RM is a piece of cake (thank you RM-developpers!), saying which field to pick as well.
2/ Using the WVTool on a directory of PDFs is easy as well
But... is there a way to forward the MySQL stream (containing the PDFs actually) to the TextInput Method of the WVTool and make the WVTool use these fields of the different records as files ? Or do I need to use another method than TextInput to perform this task.
Thus in summary: how can I replace the directory or URL input option by a kind of ExampleSet input option ?
BTW, I'm working on a Linux FC8 system if this can help to solve/circumvent this problem.
All help is greatly appreciated.
(A possible solution will be to start from PDFs on a filesystem and load them together with the gathered data into the MySQL database using RM, but I want to avoid this solution, because we are working in a project with different teams, some offer the data (PDFs, TXTs, ..), I'm doing the text mining, and still others will use the outcome. Hence a web-accessible database is to be preferred over a file-system...)
Best,
Patrick