Fix Job going to H State

Manoj Soni_20447
Manoj Soni_20447
Altair Employee
edited February 2023 in Altair HPCWorks

The H state means that though there are enough free resources available for the jobs, the job is not being scheduled to run. It is not always an erroneous state. Sometimes you may also want to put your job in H state using the "qhold" command. This "USER HOLD" can be released using the qrls command as below:

qrls <Job id>

Also, a dependency job puts the dependent job on the Hold(H) state until the dependency is met. Once the dependency is met, the Job automatically becomes eligible to run.

PBS Pro can also put a "SYSTEM HOLD" (H) on to the job because of any of the following reasons:

  • If user/job owner is not able to login to the compute node, this may indicate the NIS/LDAP service is down on that compute node.
  • If you have added a new user to NIS/LDAP, you need to restart the pbs_mom on the compute node.
  • Required file systems are not mounted on the compute node like user home directory.
  • If you are using cgroups and if there are orphan cgroup reservation on the scheduled compute node, in this case PBS is not aware of those orphan reservations of the previous jobs ran on that node. So, PBS Pro actually schedules the job on that compute node but cgroup assumes it as oversubscription of the resources and doesnt create the cgroup allocation for the job and job turn into H state.  Easiest way to flush off the orphan cgroup reservation from a compute node is to restart the pbs_mom service on that compute node. Just make sure there is no job is running on that compute node when you restart the pbs_mom service as it might kill the running job. 

To remove SYSTEM HOLD from the job, you can again use the qrls command but this time with an additional option -h as below:

qrls  -h  s  <job id>