hi everyone

Mahmud_elabo
Mahmud_elabo New Altair Community Member
edited November 2024 in Community Q&A

I'm new to rapidminer . i have about 200 pdf files and I wanna do text mining and I need just keywords from those files 
can anyone help here 
thanks in advance
Tagged:

Best Answer

  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Mahmud_elabo,

    First thing is to extract text from PDF.
    You will need "Process Documents from Files" Operator from text processing extension. More demo videos are available from Academy about vectorization and extract keywords (E.g. TFIDF)
    https://academy.rapidminer.com/catalog?query=text mining



    You can define the location/path where PDF files are stored. If the text from PDF are stored as "images", you may need some 3rd party OCR (Optical character recognition) tool.

    Hope it helps.

    YY

Answers

  • jacobcybulski
    jacobcybulski New Altair Community Member
    Could you explain what do you mean by keywords?
  • YYH
    YYH
    Altair Employee
    Answer ✓
    Hi @Mahmud_elabo,

    First thing is to extract text from PDF.
    You will need "Process Documents from Files" Operator from text processing extension. More demo videos are available from Academy about vectorization and extract keywords (E.g. TFIDF)
    https://academy.rapidminer.com/catalog?query=text mining



    You can define the location/path where PDF files are stored. If the text from PDF are stored as "images", you may need some 3rd party OCR (Optical character recognition) tool.

    Hope it helps.

    YY
  • Mahmud_elabo
    Mahmud_elabo New Altair Community Member
    Thank you for your help @yyhuang
    but still could not able to extract only keywords and make a table for word frequency 
  • Mahmud_elabo
    Mahmud_elabo New Altair Community Member
    @jacobcybulski   i have pdf files i need to import them and wanna make wrod frequency 
    for just the keywords in the pdf files
  • jacobcybulski
    jacobcybulski New Altair Community Member
    @Mahmud_elabo if you are able to extract the text from your PDF files, as suggested by @yyhuang, then select the vector creation parameter as "Term occurrences", which will count the number of times each term was found in a given PDF. When it is all done, you may wish to create an aggregate sum of those frequencies, to know the number of times each term was found across all documents. If you wanted to find only a specific (predefined) "keywords" in such document, you will need to prepare a word list, which can then be passed to "Process Documents from Files" operator (is this what you are after?).