"Scrape a website and download hyperlinked pdf files"

gary_molloy
gary_molloy New Altair Community Member
edited November 5 in Community Q&A

I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?

Tagged:

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    Is the "Open File" operator not doing what you want?  It allows you to get files from any URL or file path and have them as a file object, which can then be stored.  If you have multiple files then you can use macros and put this in a loop.

    If you want to scrape actual web pages, then use "Get Page" or "Get Pages" instead.

     

  • sgenzer
    sgenzer
    Altair Employee

    hello @gary_molloy - if you use the "Crawl Web" operator (Web Mining extension), there is an option to "write pages to disk".  This will save the PDFs like normal.  I have done this many times.


    Scott