"Web crawling -overcome memory limit, split urls into susamples and then combine"

User: "In777"
New Altair Community Member
Updated by Jocelyn
Hello,

I retrieve data from several web pages (>30000) with the "get pages" operator. I have imported all my urls to the repository from the excel file.  Then I process the information with regex (I extract several categories) and write the information about categories to excel in a separate raw for each url. My process works fine with small number of urls but my computer does not have enough memory to process all web pages at once. I would like to split them into pieces like 2000 urls each and do this process separately. At the end I will join excel files together. I looked at sampling operators, but most of them produce random sample. I want to keep the order in which the urls are crawled (if possible). I think I need to write a loop, but I cannot figure out where to start. For example I do not know which loop operator to use and how to make it to write several excel files or sheets with different names (1-x). Could anabody help me with that.

Find more posts tagged with