Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

How to divide select appropriate resource for MIG GPU

Hi, I recently upgraded to AGE 2023.1.0 (8.8.0) for the MIG support in dividing up 3x 40 G A100 Nvidia GPU on a cluster.
I successfully partitioned each card into 7 x 5g MIG devices and assigned them as a resource gpu,

qconf -mc

#name                      shortcut                   type        relop    requestable    consumable    default    urgency    aapre    affinity    do_report    is_static

gpu                        gpu                        RSMAP       <=       YES            HOST          0          0          YES      0.000000    YES          NO

qconf -me <gpuhost>
complex_values gpu=21(A100[cuda_id=0,device=/dev/nvidia0, \
uuid=MIG-04026c5c-e6c1-592d-8041-f512fc883c15] \
A100[cuda_id=1,device=/dev/nvidia1, \
uuid=MIG-67084d7e-6bea-54a7-830b-378ab8dad274] \
A100[cuda_id=2,device=/dev/nvidia2, \
uuid=MIG-b1e67652-6a66-5b5a-8e3f-30bc26512d15] \

and so on.... for all 21

users request with qsub -l gpu=1 or however many they need and AGE sets a cgroup reserved MIG ID and the job runs, however it looks like one size doesn't suit all and computational chemists doing FEP doing are happy with many 5g devices but computational Biologists would like less devices with more memory, Can anyone suggest a sensible and perhaps dynamic way of assigning different sized GPU?

other than specifying different complex values and using an arbitrary resource gpu.5g gpu.10g gpu.40g ?

Find more posts tagged with

English

Grid Engine

Accepted answers

All comments

Andy Schwierskott_22306

Hi Ian,

this is Andy from Altair from the Grid Engine team.

There are a couple of complex considerations when it comes to MIG reconfiguration. First, an A100 in contrast to a H100 cannot be reconfigured from full GPU mode to MIG mode and vice versa without a complete reset of the device which in turn only works when no processes use the Nvidia libs. This would require to stop the Grid Engine execution daemon and the DCGM. Normal MIG reconfiguration however can be done with A100's at runtime as long as they have no running processes.

Dynamic MIG reconfiguration is not yet supported out of the box by Grid Engine, it is an enhancement we consider to add to Altair's workload management systems in a future release. We have a rough sketch how this could be achieved today with some scripting. I suggest you open a support ticket and we can discuss with you. Please refer my name that the ticket is forwarded to me.

Thanks!

Andy Schwierskott

--
Andy Schwierskott
Product Manager
Altair | Nasdaq: ALTR
Connect with us:
T: www.twitter.com/altair_inc | F: www.facebook.com/altairengineering/
L: www.linkedin.com/company/altair-engineering