Text extraction of key themes/words from series of pdf files

pimlico35
pimlico35 New Altair Community Member
edited November 5 in Community Q&A
Hi Folks,

Im new to this & struggling a little bit :)

I just wanted some easy (explicit) steps to help me achieve what I want to do, which is:

I have a series of mostly pdf reports;
- I want to extract key themes or words that recur throughout the reports, for example 'serious accident' or 'safety'

What I have done so far is to put all these files into a new repository.  I have tried to use operators to read through the files, tokenise etc - but Im getting lost in translation so to speak ;)

- Im not sure whether I have to convert the pdfs into word files - if that makes it easier before getting it into rapidminer; but that seems to defeat the whole purpose ....

- I want to then have a document or table of these extracted common occuring words so I can see how often they are used.  Later then I can also check in the output document the least used words...

I would really appreciate any help or pointing me in the direction of videos that explicitly look at this.

thanks so much!
Tagged: