It have "pbs_tmrsh" commands executing in a loop. The problem I'm facing is that one "pbs_tmrsh" command has to finish before the next one is executed. I need for all of them to run in parallel, as they are part of the same job and the job needs for all of them to be running concurrently.
Here's my script that simply runs the hostname and date commands on the allocated nodes.
| #PBS -l select="6:ngpus=1" #PBS -q gpu2 # Show the allocated nodes in the ${PBS_NODEFILE}. echo "START PBS_NODEFILE=${PBS_NODEFILE}" cat $PBS_NODEFILE echo END PBS_NODEFILE # Remove the duplicate nodes to avoid running multiple tasks on each node, # such that only one task on each node. for host in $(sort -u "${PBS_NODEFILE}") do pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60' done |
You can see by the output that the second "pbs_tmrsh" command ran after the 60 second sleep from the first "pbs_tmrsh" command completed.
| [corujor@node003 ~]$ cat pbs_batch10.sh.o822 START PBS_NODEFILE=/var/spool/pbs/aux/822.node003 node002 node002 node002 node002 node010 node010 END PBS_NODEFILE hostname=node002 : date=Tue Sep 13 14:59:06 CDT 2022 hostname=node010 : date=Tue Sep 13 15:00:06 CDT 2022 |
I tried adding an ampersand to the end of the line in an effort to run the "pbs_tmrsh" command in the background to allow the next "pbs_tmrsh" run in parallel, as shown below.
pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60' &
However, the "pbs_mom" daemon on the compute node kills the job immediately when the ampersand is used.
| 09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;JOIN_JOB as node 1 09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;KILL_JOB received 09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;kill_job 09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;DELETE_JOB received 09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;kill_job |
Is there a way to execute multiple pbs_tmrsh commands that are part of the same job concurrently?
Thank you.
Rigoberto