How do I resolve error: "Job NNNN could not start after 1m00s, killing subslave MMMM"? (NetworkComputer)


If you see the following message (where NNNNN and MMMM are numbers), this means that vovslaveroot did not receive a pid from the subslave within the designated time.

Job NNNNN could not start after > 1m00s. Killing subslave with pid=MMMM

A NetworkComputer machine (that is part of the computing farm) has a top-level vovslaveroot process that runs as a root. This maintains a persistent TCP/IP connection to vovserver, from which it receives jobs and and returns info about those jobs.

When vovslaveroot receives a job, it uses its root privileges to switch to the account of the job submitter to obtain the correct permissions. 1m00s is the default interval. In this context, this message means that vovslaveroot did not receive a pid from the subslave within this designated time.

To change this, you can use the env-var VOV_MAX_WAIT_NO_START to extend the wait time.

For example:

% nc cmd vovshow -env MAX

Set the value in the 'setup.tcl' configuration file, or via vovsh with the vtk_server_setenv API call.

You can also set the value in running vovslaves without needing to restart with the following command:

% nc cmd vovsh -x ‘vtk_slave_config slave-name maxwaitnostart wait-value’