"Reading Microsoft word documents (word count)"
SergeMerz
New Altair Community Member
Hi,
I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.
Regards,
Serge
I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.
Regards,
Serge
Tagged:
0
Answers
-
Hi,
I'm afraid that is currently not possible.
Regards,
Marco0 -
Hi
I have the same problem.
Currently I use a bash script to convert DOC and DOCX but I would like to avoid this pre-processing step.
Please let me know if you find something that can help.
Regards
Johan0 -
Unfortunately RapidMiner is not capable of dealing with Word documents natively. You have to use a command line tool to extract the text, e.g. antiword: http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/
You can run the program from your RapidMiner process with the Execute Program operator.
Best regards,
Marius1