How to stop jobs using excess RAM (NetworkComputer)


Sometimes jobs will use more RAM than expected. This can happen when using NC with hardware-software co-development when code gets into infinite loops.

NetworkComputer includes a system health check that can identify and stop such jobs. This should prevent those jobs from taking over a machine and crashing it, and also taking down other jobs on the machine.

To implement this health check, perform the following steps:

Note: You need to be the NC owner and have admin permissions. 

1. Go to the admin web page and follow the link to the daemons page.

2. Click the vovnotifyd daemon "config" button. This will bring up a web page showing all the available health checks.

3. Click the health check called KillBullyJobs.

4. Click the edit button on that line.

5. Add the following options on the options field, taking the total RAM and number of slots into account. This is an estimate of adjustments that you might make:

   -margin 1000 -minram 1000 -minfree 2000
 


In this case, a job must be using more than 1 gig of ram above its ram request, the slave must have less than 2 GB of RAM free, and any job using less than 1 GB RAM will be ignored.

Any job that is flagged will be killed, and the user will get an email noting this.

Typically, runaway jobs will consume all the RAM on a slave instantly and will easily violate the 2GB of free space.

6. Save the options and enable the health check. The health check will start killing jobs as soon as it detects any that meet the specified conditions.