Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Automatically allocating nodes by specifying the number of GPUs

Rigoberto_20495

Hello,

With Slurm, you can use "salloc --gpus=12" to allocate 12 GPUs, and Slurm will automatically allocate the number of nodes needed to satisfy the 12 GPU requirement. For example, if each GPU node has 4 GPUS, then "salloc --gpus=12" will automatically allocate 3 nodes, since (3 nodes x 4 GPUs each) = 12 GPUs total.

It this possible with PBS? I tried "qsub -l ngpus=12 ...", but "qstat" reports:

Can Never Run: Insufficient amount of resource: sales_op (none != )

I don't want to have to specify "-l nodes=" or "-l select=", since I would solely like the number of nodes that PBS allocates to be based on the number of GPUs requested.

Any suggestions on how to do this with PBS would be greatly appreciated.

Thank you.

Find more posts tagged with

English

PBS Professional

Accepted answers

Joshua Newman (Altair)

Thank you, Josha. I'm not sure I quite understand what you've suggested, though. The man page for "pbs_tmrsh" says:

The program is intended to be used during MPI integration activities, and not by end-users.

We don't want to use MPI to launch the job, because we don't know if the customer will have MPI installed or, if they do have it installed, where it is installed or what flavor of MPI they have. We would rather stick with "pbsdsh", if possible, which ships with PBS Pro.

Is there a way that I can continue to use "pbsdsh", but modify the $PBS_NODEFILE prior to its execution, such that duplicate nodes are removed and it only has one node per task?

Currently, when I run:

qsub -q rig_test_gpu -l "select=8:ngpus=1" -- /opt/pbs/bin/pbsdsh -- bash -c 'echo "$(hostname);CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'

The $PBS_NODEFILE has this content:

node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com

If I can "sort $PBS_NODEFILE | uniq" prior to executing "pbdsh", then it would contain:

node010.head.cm.us.cray.com
node002.head.cm.us.cray.com

and only two tasks, one on each node, would hopefully get executed.

Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

> The program is intended to be used during MPI integration activities, and not by end-users

This is the primary use case, but it can indeed be called directly and function properly. There is an example in the PBS Pro Admin Guide 2022.2 section 8.5.8.3 "Example Job" where pbs_tmrsh is directly called within the job script.

> Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

Pre-execution scripts can be created using hooks such as execjob_begin or execjob_launch, though from my tests, modifying $PBS_NODEFILE has no effect on pbsdsh. It appears the documentation should be clarified about the $PBS_NODEFILE, as it is apparently not used as the source node list for pbsdsh. I will create a ticket for that.

My recommendation would be to use pbs_tmrsh, ssh or pdsh.

Thanks!

Joshua

All comments

Joshua Newman (Altair)

Hello Rigoberto,

For selectable node-level resources such as ngpus, the select statement is still used. "qsub -l select=12:ngpus=1" should get you what you need, as it will assign 12 chunks containing 1 ngpu. This may fit all on one node or be spread across multiple nodes with free placement.

Hope this helps!

Joshua

Rigoberto_20495

Thank you so much, Joshua, but I really need only 1 task per node for the job.

I have two 4-GPU nodes that I'm testing with. What I would like is to have all 8 GPUs allocated to the job and run 1 task per node.

When I tried something similar to what you suggested, I got 8 tasks.

[corujor@node003 ~]$ qsub -q rig_test_gpu -l "select=8:ngpus=1" -- /opt/pbs/bin/pbsdsh -- bash -c 'echo "$(hostname);CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'
554.node003

The qstat output shows:

Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
554.node003 corujor rig_test_gpu STDIN -- 8 8 -- 01:00 R -- node010/0+node010/1+node010/2+node010/3+node002/0+node002/1+node002/2+node002/3

Job was sent for execution at Tue Aug 30 at 12:00 on (node010[1]:ngpus=1:ncpus=1)+(node010[1]:ngpus=1:ncpus=1)+...

The output file does show that 2 nodes were allocated, but each node ran 4 tasks.

[corujor@node003 ~]$ cat STDIN.o554
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b

I would like to get only 1 task per node that is allocated, where each task is assigned 4 GPUs in this example.

Thank you,

Rigoberto

Joshua Newman (Altair)

Thank you so much, Joshua, but I really need only 1 task per node for the job.

I have two 4-GPU nodes that I'm testing with. What I would like is to have all 8 GPUs allocated to the job and run 1 task per node.

When I tried something similar to what you suggested, I got 8 tasks.

[corujor@node003 ~]$ qsub -q rig_test_gpu -l "select=8:ngpus=1" -- /opt/pbs/bin/pbsdsh -- bash -c 'echo "$(hostname);CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'
554.node003

The qstat output shows:

Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
554.node003 corujor rig_test_gpu STDIN -- 8 8 -- 01:00 R -- node010/0+node010/1+node010/2+node010/3+node002/0+node002/1+node002/2+node002/3

Job was sent for execution at Tue Aug 30 at 12:00 on (node010[1]:ngpus=1:ncpus=1)+(node010[1]:ngpus=1:ncpus=1)+...

The output file does show that 2 nodes were allocated, but each node ran 4 tasks.

[corujor@node003 ~]$ cat STDIN.o554
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node010;CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b
node002;CUDA_VISIBLE_DEVICES=GPU-ee310a06-c7fb-f921-f137-d605913909b1,GPU-b4b1244f-a384-ac0a-2933-ee2f2e10ae26,GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b

I would like to get only 1 task per node that is allocated, where each task is assigned 4 GPUs in this example.

Thank you,

Rigoberto

Yes, this makes sense if you're using pbs_dsh as it starts a task for every line in $PBS_NODEFILE. You would need to use an alternative to be more selective in the number of tasks you're running.

If you wanted to stick to the Task Manager, you could use pbs_tmrsh <node> <cmd> <args>, which uses the same syntax as the old rsh command, but runs over PBS rather than over rshd. You are restricted to nodes assigned to the job though.

Tasks are also commonly started using ssh, and a $PBS_NODEFILE run through sort -u will give you a single line for each host you've been assigned.

Many MPIs can also use SSH as well as the PBS Task Manager API. Most, if not all, can take an alternative $PBS_NODEFILE input, so a sort -u $PBS_NODEFILE could come in handy here as well.

For that matter, reformatting the $PBS_NODEFILE to suit something like pdsh could be done as well:
jnewman@node01:~> cat $PBS_NODEFILE
node01.hydra
node01.hydra
node01.hydra
node01.hydra
node02.hydra
node02.hydra
node02.hydra
node02.hydra
node03.hydra
node03.hydra
node03.hydra
node03.hydra

jnewman@node01:~> pdsh -w $(sort -u $PBS_NODEFILE | sed ':a;$!N;/.\n./s/\n/,/;ta;/^[^\n]/P;D') echo Hello World!
node03: Hello World!
node01: Hello World!
node02: Hello World!

Rigoberto_20495

Yes, this makes sense if you're using pbs_dsh as it starts a task for every line in $PBS_NODEFILE. You would need to use an alternative to be more selective in the number of tasks you're running.

If you wanted to stick to the Task Manager, you could use pbs_tmrsh <node> <cmd> <args>, which uses the same syntax as the old rsh command, but runs over PBS rather than over rshd. You are restricted to nodes assigned to the job though.

Tasks are also commonly started using ssh, and a $PBS_NODEFILE run through sort -u will give you a single line for each host you've been assigned.

Many MPIs can also use SSH as well as the PBS Task Manager API. Most, if not all, can take an alternative $PBS_NODEFILE input, so a sort -u $PBS_NODEFILE could come in handy here as well.

For that matter, reformatting the $PBS_NODEFILE to suit something like pdsh could be done as well:
jnewman@node01:~> cat $PBS_NODEFILE
node01.hydra
node01.hydra
node01.hydra
node01.hydra
node02.hydra
node02.hydra
node02.hydra
node02.hydra
node03.hydra
node03.hydra
node03.hydra
node03.hydra

jnewman@node01:~> pdsh -w $(sort -u $PBS_NODEFILE | sed ':a;$!N;/.\n./s/\n/,/;ta;/^[^\n]/P;D') echo Hello World!
node03: Hello World!
node01: Hello World!
node02: Hello World!

Thank you, Josha. I'm not sure I quite understand what you've suggested, though. The man page for "pbs_tmrsh" says:

The program is intended to be used during MPI integration activities, and not by end-users.

We don't want to use MPI to launch the job, because we don't know if the customer will have MPI installed or, if they do have it installed, where it is installed or what flavor of MPI they have. We would rather stick with "pbsdsh", if possible, which ships with PBS Pro.

Is there a way that I can continue to use "pbsdsh", but modify the $PBS_NODEFILE prior to its execution, such that duplicate nodes are removed and it only has one node per task?

Currently, when I run:

qsub -q rig_test_gpu -l "select=8:ngpus=1" -- /opt/pbs/bin/pbsdsh -- bash -c 'echo "$(hostname);CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'

The $PBS_NODEFILE has this content:

node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com

If I can "sort $PBS_NODEFILE | uniq" prior to executing "pbdsh", then it would contain:

node010.head.cm.us.cray.com
node002.head.cm.us.cray.com

and only two tasks, one on each node, would hopefully get executed.

Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

Joshua Newman (Altair)

Thank you, Josha. I'm not sure I quite understand what you've suggested, though. The man page for "pbs_tmrsh" says:

The program is intended to be used during MPI integration activities, and not by end-users.

We don't want to use MPI to launch the job, because we don't know if the customer will have MPI installed or, if they do have it installed, where it is installed or what flavor of MPI they have. We would rather stick with "pbsdsh", if possible, which ships with PBS Pro.

Is there a way that I can continue to use "pbsdsh", but modify the $PBS_NODEFILE prior to its execution, such that duplicate nodes are removed and it only has one node per task?

Currently, when I run:

qsub -q rig_test_gpu -l "select=8:ngpus=1" -- /opt/pbs/bin/pbsdsh -- bash -c 'echo "$(hostname);CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"'

The $PBS_NODEFILE has this content:

node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node010.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com
node002.head.cm.us.cray.com

If I can "sort $PBS_NODEFILE | uniq" prior to executing "pbdsh", then it would contain:

node010.head.cm.us.cray.com
node002.head.cm.us.cray.com

and only two tasks, one on each node, would hopefully get executed.

Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

> The program is intended to be used during MPI integration activities, and not by end-users

> Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

My recommendation would be to use pbs_tmrsh, ssh or pdsh.

Thanks!

Joshua

Rigoberto_20495

> The program is intended to be used during MPI integration activities, and not by end-users

This is the primary use case, but it can indeed be called directly and function properly. There is an example in the PBS Pro Admin Guide 2022.2 section 8.5.8.3 "Example Job" where pbs_tmrsh is directly called within the job script.

> Is there any way to add some kind of pre-execution script that qsub can call to modify the $PBS_NODEFILE prior to it calling pbsdsh?

Pre-execution scripts can be created using hooks such as execjob_begin or execjob_launch, though from my tests, modifying $PBS_NODEFILE has no effect on pbsdsh. It appears the documentation should be clarified about the $PBS_NODEFILE, as it is apparently not used as the source node list for pbsdsh. I will create a ticket for that.

My recommendation would be to use pbs_tmrsh, ssh or pdsh.

Thanks!

Joshua

Thank you so much, Joshua. This appears to be doing what we need.

#PBS -l select="6:ngpus=1"   # Show the allocated nodes in the ${PBS_NODEFILE}. echo "START PBS_NODEFILE=${PBS_NODEFILE}" cat $PBS_NODEFILE echo END PBS_NODEFILE   # Remove the duplicate nodes to avoid running multiple tasks on each node, # such that only one task on each node. for host in $(sort "${PBS_NODEFILE}" | uniq) do     pbs_tmrsh $host /bin/bash -c 'echo "hostname=$(hostname) : PBS_JOBID=${PBS_JOBID} : PBS_TASKNUM=${PBS_TASKNUM} : PBS_NODENUM=${PBS_NODENUM} : CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}"' done

The output looks correct. We have two 4 GPU nodes, "node002" and "node010". We requested 6 GPUs, and, as desired, 1 task was run per node, where the task on "node010" was allocated 4 GPUs and the task on "node002" was allocated 2 GPUs.

START PBS_NODEFILE=/var/spool/pbs/aux/657.node003 node010 node010 node010 node010 node002 node002 END PBS_NODEFILE hostname=node002 : PBS_JOBID=657.node003 : PBS_TASKNUM=20000001 : PBS_NODENUM=1 : CUDA_VISIBLE_DEVICES=GPU-8536a49c-3b4c-c98c-7f14-b0a7c7755d64,GPU-df5fc6c8-540a-ed75-f907-80647319fb3b hostname=node010 : PBS_JOBID=657.node003 : PBS_TASKNUM=00000002 : PBS_NODENUM=0 : CUDA_VISIBLE_DEVICES=GPU-bca9a972-41ae-78eb-c159-e1969b434ebe,GPU-4bb029f5-0fc1-5c67-36c8-d4f9eba6a747,GPU-f21641b2-5ec1-8f75-9bd2-3de108be190d,GPU-c9c1468c-d47d-b1d2-f00a-a9267a6bd1a5

Thank you once again.