Manage number of running jobs per user (NetworkComputer)

AlanB_22262
AlanB_22262 New Altair Community Member
edited February 2023 in Altair HPCWorks

The command line submitted to NetworkComputer is run under the control of a wrapper program, which is usually vw. The vw wrapper sets up the stdin streams for the submitted job and actually starts them by calling execve() with the command line. The wrapper connects back to the NC vovserver so it can update it with the job status and the result when the job finishes.

The NC GUI and jobs using -w options also listen to the stream of microevents from the vovserver.  This uses an additional file descriptor. The number of file descriptors available to the NC vovserver limits how many clients may connect concurrently. This is inherited from the shell where ncmgr was used to start NC.

Because both normal and notify clients use file descriptors in the vovserver process, there are limits that deter accidental or intentional denial-of-service attacks by exhausting vovserver's file descriptors.

These are controlled by server configuration parameters maxNormalClients and maxNotifyClients, as described here: http://nc-host:nc-port/doc/VOV_RefManual/error_too_many_notify_clients.html (Substitute your local values for host and port)

If you adjust these up from the default of 400 (normal) and 40 (notify), be sure that vovserver is running with enough file descriptors to avoid exhaustion.

There is a way, used typically in larger farms, to run without a connection to the vovserver using vwn or vw -N. But you lose the ability to set VOV_LM_VARNAMES and to check the errors of timestamps and outputs.

To make this the default, in the optional config file $VOVDIR/local/vncrun.config.tcl set VOV_JOB_DESC(wrapper) vwn. Include '-wrapper vwn' in the env-var NC_RUN_ARGS in the submit env.

OR, you can also use the vwn wrapper on a job-by-job basis by using -wrapper vwn on the submit command line.