"Reading Microsoft word documents (word count)"

SergeMerz
SergeMerz New Altair Community Member
edited November 5 in Community Q&A
Hi,
  I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.

Regards,
Serge

Answers

  • Marco_Boeck
    Marco_Boeck New Altair Community Member
    Hi,

    I'm afraid that is currently not possible.

    Regards,
    Marco
  • johan_CG
    johan_CG New Altair Community Member
    Hi

    I have the same problem.
    Currently I use a bash script to convert DOC and DOCX but I would like to avoid this pre-processing step.
    Please let me know if you find something that can help.

    Regards
    Johan
  • MariusHelf
    MariusHelf New Altair Community Member
    Unfortunately RapidMiner is not capable of dealing with Word documents natively. You have to use a command line tool to extract the text, e.g. antiword: http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

    You can run the program from your RapidMiner process with the Execute Program operator.

    Best regards,
    Marius