What to do when a Job fails to Start (NetworkComputer)

AlanB_22262
AlanB_22262 New Altair Community Member
edited February 2023 in Altair HPCWorks

Rarely, a job may not start, but if so, you can run the following tests to find the source of the issue.

You may receive a message similar to the following:

Job killed because it failed to start within 1m00s.

This implies that the issue is caused by a bad NFS mount point or some other unknown reason.

If you examine the vovserver's log file, you may also see a message like this:

vovserver(17638) ERROR Feb 21 21:18:03 Corrupt id 000000000 in VS_dispatch::VOV_RefuseJob msg=Bad subslave id for slave 'grid050602'

The main vovslaveroot process running as root fork-execs itself when it receives a job from vovserver. It uses its privileges to switch the child vovslaveroot to run as the job submitter, so that it will have the correct permissions with respect to the submitter's files.

That message means that the root-vovslaveroot did not get a pid for the child in the default time interval of 60 seconds. This could be because of NFS or other reasons (out of memory, out of processes, etc.)  If no pid is obtained in a timely manner, the job fails (it will never be launched).

There is an env-variable VOV_MAX_WAIT_NO_START that may be set (via the vnc.swd/setup.tcl) that can change the time that vovslave waits for a child process. This requires a vovslave restart.

Note: You can change the interval without a vovslave restart with the following command:

  + get a shell as the NC owner with Runtime commands in the PATH

 % nc cmd vovsh -x 'vtk_slave_config vovslave-name maxwaitnostart time-spec'

The following command changes the interval to 4 minutes for a vovslave named grid050602

 % nc cmd vovsh -x ''vtk_slave_config grid050602 maxwaitnostart 4m'