Mining a PDF document

Gjor
Gjor New Altair Community Member
edited November 5 in Community Q&A
I'm new to rapid miner. i would like to mine a pdf to create a word and number vector. I using the following operators:
Operators as follows;
1.  Read document ( Content type: PDF and Encoding: system)
2. Process Document from Data  (Prune method: absolute  and datamanagement: double_sparsey_array)
    Inside Process Document from Data
    2.a  Extract information ( Query type:string matching)
    2.b  Tokenize (mode:non letter)
    2.c  Transform case (Transform to: Lower case)

Error Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet


Stack trace:
------------

Exception: java.lang.ClassCastException
Message: com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
Stack trace:
  com.rapidminer.operator.text.io.ExampleSetDocumentInputOperator.getTextObjects(ExampleSetDocumentInputOperator.java:110)
  com.rapidminer.operator.text.io.AbstractDocumentInputOperator.doWork(AbstractDocumentInputOperator.java:224)
  com.rapidminer.operator.Operator.execute(Operator.java:833)
  com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
  com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)
  com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
  com.rapidminer.operator.Operator.execute(Operator.java:833)
  com.rapidminer.Process.run(Process.java:925)
  com.rapidminer.Process.run(Process.java:848)
  com.rapidminer.Process.run(Process.java:807)
  com.rapidminer.Process.run(Process.java:802)
  com.rapidminer.Process.run(Process.java:792)
  com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)





Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.
Tagged:

Answers

  • Hello

    The output from the Read Document operator is a document whereas the Process Documents from Data expects an Example Set.

    One option is to insert a Documents to Data operator between them.

    Another better option would be to use the Read Documents from Files operator.

    regards

    Andrew