[Solved] Web Crawler Operator: Empty folder and results

Question

Hi, I have followed all the instructions with regards to http://auburnbigdata.blogspot.com/2013/04/web-crawling-with-rapidminer.html. My web crawler folder is empty. What am I doing wrong? The system times out at 42s. Has anyone had this problem after changing to .+auburnbigdata.+?

Kate_Strydom · Answer

:)  I am so excited to create data outside of the DWH. It gives new meaning to data mining.

We do not really know what happened but it now works on our virtual machine setup although there still seems to be a problem still on RM on my pc.

An SA RM user suggested that we:
change the default max page size to 500.

Our server expert played around, then we changed the max threads to 4. Perhaps the crawler operator needs more threads, as my pc is limited to 2 threads.

We then tested it on a different website and I cannot wait to continue to learn the text processing part of RM.

I noticed that leaving the max pages blank means the crawler pulls everything. We first tested on max pages 20.