"pdf tokenization (?)"

margkw
margkw New Altair Community Member
edited November 5 in Community Q&A
Hello guys,
I am totally new here and to the rapidminer!!
I have an assignment to get done so there is not much time for me to explore rapid miner. I will set my question here and I hope I will find the answer. It might be trivial.I apologise for that..

I have several pdf files. I want to tokenize them, i.e to see the multiple appearances of each word and how many times each word appears..
For example let's assume that in a pdf there is the word "process"..I want to see how many times this word appears. And that is what I want to do for all the words in the pdf file. Is tokenization what I need to do? If yes, how do I do it? If not what do you propose?
Thank you in advance!
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Yes, it is. Just load the data with Read Documents from Files, connect it to Process Documents, inside Process Documents add the Tokenize operator, and finally connect the output ports of the Process Documents operator to the process output.

    To get the aforementioned operators, you have to install the Text Processing extension.

    Best, Marius
  • margkw
    margkw New Altair Community Member
    Thank you very much.I will try that out and I will get back to you if I have any problem...Many many thanks!!!! :):):):):)
  • margkw
    margkw New Altair Community Member
    It's me again!How can I insert the tokenize operator inside Process Documents?

    And the process output should be what?

    Sorry for the stupid questions..I am completely new to this..
  • MariusHelf
    MariusHelf New Altair Community Member
    Hi,

    these are very important concepts which are rather easy to understand, but hard to explain here in text form. I would like to forward you to our video tutorials on our website; there is one complete section about text processing.

    You'll find the link to the tutorials in the post linked in my signature.

    Happy Mining!
      -Marius
  • margkw
    margkw New Altair Community Member
    THANKS!I will be back with more questions! :D