"Analytics with RapidMiner Rosette [getting started]"

ty
ty New Altair Community Member
edited November 5 in Community Q&A

Hi,

I'm just getting started with RM for text analytics. Everything has gone well working with structured data but I'm struggling with analysing text documents. Could you anyone provide a process of how to extract entities from a PDF or Word Doc? 

 

I've searched these forums and Google and the only solution that seems to work is converting the file into a txt file first, which isn't ideal.

 

Any help would be super appreciated.

Best Answer

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    I am afraid that converting the files first is the easiest option available to you with existing operators.  Another option would be to import your document text into a database first using a database program like MySQL and then use "Read Database."  But RapidMiner won't read Word Docs or PDF text directly.  

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member
    Answer ✓

    I am afraid that converting the files first is the easiest option available to you with existing operators.  Another option would be to import your document text into a database first using a database program like MySQL and then use "Read Database."  But RapidMiner won't read Word Docs or PDF text directly.  

  • sgenzer
    sgenzer
    Altair Employee

    Hi...you can extract the text from PDFs no problem with the Read Document operator.  It will lose all formatting but very easy to do.  If you want to pull structured data from a table inside a PDF, there is the new "PDF Table Extraction" extension which is rather good.

     

    As for docx, what I do is send the file to a converter API engine (like Zamzar or Convertio) and convert it to text, then import.

     

    Scott

     

  • ty
    ty New Altair Community Member

    Hi Scott,

    I'm guessing it depends on what's in the specific file. All I get from using the read document operator is non-sensical text in a font I can't make out. 

     

     

  • ty
    ty New Altair Community Member

    Thanks a bunch for the input guys.

  • ramakant_koli
    ramakant_koli New Altair Community Member

    test abc