Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

How do I get multiple pbs_tmrsh commands to run in parallel?

Rigoberto_20495

It have "pbs_tmrsh" commands executing in a loop. The problem I'm facing is that one "pbs_tmrsh" command has to finish before the next one is executed. I need for all of them to run in parallel, as they are part of the same job and the job needs for all of them to be running concurrently.

Here's my script that simply runs the hostname and date commands on the allocated nodes.

#PBS -l select="6:ngpus=1"
#PBS -q gpu2

# Show the allocated nodes in the ${PBS_NODEFILE}.
echo "START PBS_NODEFILE=${PBS_NODEFILE}"
cat $PBS_NODEFILE
echo END PBS_NODEFILE

# Remove the duplicate nodes to avoid running multiple tasks on each node,
# such that only one task on each node.

for host in $(sort -u "${PBS_NODEFILE}")
do
pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60'
done

You can see by the output that the second "pbs_tmrsh" command ran after the 60 second sleep from the first "pbs_tmrsh" command completed.

[corujor@node003 ~]$ cat pbs_batch10.sh.o822
START PBS_NODEFILE=/var/spool/pbs/aux/822.node003
node002
node002
node002
node002
node010
node010
END PBS_NODEFILE

hostname=node002 : date=Tue Sep 13 14:59:06 CDT 2022
hostname=node010 : date=Tue Sep 13 15:00:06 CDT 2022

I tried adding an ampersand to the end of the line in an effort to run the "pbs_tmrsh" command in the background to allow the next "pbs_tmrsh" run in parallel, as shown below.

pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : date=$(date)"; sleep 60' &

However, the "pbs_mom" daemon on the compute node kills the job immediately when the ampersand is used.

09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;JOIN_JOB as node 1
09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;KILL_JOB received
09/13/2022 15:18:29;0008;pbs_mom;Job;823.node003;kill_job
09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;DELETE_JOB received
09/13/2022 15:18:30;0008;pbs_mom;Job;823.node003;kill_job

Is there a way to execute multiple pbs_tmrsh commands that are part of the same job concurrently?

Thank you.

Rigoberto

Find more posts tagged with

English

PBS Professional

v2021.1

Accepted answers

All comments

Rigoberto_20495

I resolved this issue. Putting an "&" at the end of the "pbs_tmrsh" command to create a background task, plus adding a "wait" after the "for-loop" to prevent the batch script from exiting until the background tasks complete, solves the problem.