What does vovslave SICK status mean & how do I correct it? (NetworkComputer)

AlanB_22262
AlanB_22262 New Altair Community Member
edited February 2023 in Altair HPCWorks

Usually, the status of NC vovslaves is FULL, WRKNG, READY, and occasionally BUSY.

These are normal and indicate that vovslave is running one or more jobs or is available to run jobs.

The status is attached to an in-memory object in the NC vovserver that represents the vovslave, and is updated by messages from the vovslave.

The SICK status indicates unusual situation. The vovslave exchanges a periodic heartbeat message with the NC vovserver.  If the vovserver has not received the heartbeat for three consecutive vovslave update cycles, it assigns a SICK status to the vovslave.

The lack of the heartbeat message can happen for a variety of reasons. It can arise because of a network connectivity problem. If so, when that is resolved, it may self-correct.

More commonly, it arises because the top-level vovslaveroot process on the grid host has quit unexpectedly or was killed. If this happens, the jobs on that vovslave become lost to NC. They may continue to run to completion, but you can only determine this by examining the log files of the job.

To correct a SICK vovslave, check whether there is a vovslaveroot process running as a root on the vovslave host machine. If not, use the vovslavemgr command to stop the vovslave and re-start it.