How do I get multiple pbs_tmrsh commands to run in parallel?

Rigoberto_20495
Rigoberto_20495 Altair Community Member
edited September 2022 in Community Q&A

It have "pbs_tmrsh" commands executing in a loop.  The problem I'm facing is that one "pbs_tmrsh" command has to finish before the next one is executed.  I need for all of them to run in parallel, as they are part of the same job and the job needs for all of them to be running concurrently.

Here's my script that simply runs the hostname and date commands on the allocated nodes.

#PBS -l select="6:ngpus=1"
#PBS -q gpu2

 # Show the allocated nodes in the ${PBS_NODEFILE}.
echo "START PBS_NODEFILE=${PBS_NODEFILE}"
cat $PBS_NODEFILE
echo END PBS_NODEFILE

# Remove the duplicate nodes to avoid running multiple tasks on each node,
# such that only one task on each node.

for host in $(sort -u "${PBS_NODEFILE}")
do
    pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60'
done


You can see by the output that the second "pbs_tmrsh" command ran after the 60 second sleep from the first "pbs_tmrsh" command completed.

[corujor@node003 ~]$ cat pbs_batch10.sh.o822
START PBS_NODEFILE=/var/spool/pbs/aux/822.node003
node002
node002
node002
node002
node010
node010
END PBS_NODEFILE

hostname=node002 : date=Tue Sep 13 14:59:06 CDT 2022
hostname=node010 : date=Tue Sep 13 15:00:06 CDT 2022


I tried adding an ampersand to the end of the line in an effort to run the "pbs_tmrsh" command in the background to allow the next "pbs_tmrsh" run in parallel, as shown below.


pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60' &

However, the "pbs_mom" daemon on the compute node kills the job immediately when the ampersand is used.

09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;JOIN_JOB as node 1
09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;KILL_JOB received
09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;kill_job
09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;DELETE_JOB received
09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;kill_job



Is there a way to execute multiple pbs_tmrsh commands that are part of the same job concurrently?

Thank you.

Rigoberto

Answers

  • Rigoberto_20495
    Rigoberto_20495 Altair Community Member
    edited September 2022

    I resolved this issue.  Putting an "&" at the end of the "pbs_tmrsh" command to create a background task, plus adding a "wait" after the "for-loop" to prevent the batch script from exiting until the background tasks complete, solves the problem.