How to divide select appropriate resource for MIG GPU
Hi, I recently upgraded to AGE 2023.1.0 (8.8.0) for the MIG support in dividing up 3x 40 G A100 Nvidia GPU on a cluster.
I successfully partitioned each card into 7 x 5g MIG devices and assigned them as a resource gpu,
qconf -mc
#name shortcut type relop requestable consumable default urgency aapre affinity do_report is_static
gpu gpu RSMAP <= YES HOST 0 0 YES 0.000000 YES NO
qconf -me <gpuhost>
complex_values gpu=21(A100[cuda_id=0,device=/dev/nvidia0, \
uuid=MIG-04026c5c-e6c1-592d-8041-f512fc883c15] \
A100[cuda_id=1,device=/dev/nvidia1, \
uuid=MIG-67084d7e-6bea-54a7-830b-378ab8dad274] \
A100[cuda_id=2,device=/dev/nvidia2, \
uuid=MIG-b1e67652-6a66-5b5a-8e3f-30bc26512d15] \
and so on.... for all 21
users request with qsub -l gpu=1 or however many they need and AGE sets a cgroup reserved MIG ID and the job runs, however it looks like one size doesn't suit all and computational chemists doing FEP doing are happy with many 5g devices but computational Biologists would like less devices with more memory, Can anyone suggest a sensible and perhaps dynamic way of assigning different sized GPU?
other than specifying different complex values and using an arbitrary resource gpu.5g gpu.10g gpu.40g ?
Answers
-
Hi Ian,
this is Andy from Altair from the Grid Engine team.
There are a couple of complex considerations when it comes to MIG reconfiguration. First, an A100 in contrast to a H100 cannot be reconfigured from full GPU mode to MIG mode and vice versa without a complete reset of the device which in turn only works when no processes use the Nvidia libs. This would require to stop the Grid Engine execution daemon and the DCGM. Normal MIG reconfiguration however can be done with A100's at runtime as long as they have no running processes.
Dynamic MIG reconfiguration is not yet supported out of the box by Grid Engine, it is an enhancement we consider to add to Altair's workload management systems in a future release. We have a rough sketch how this could be achieved today with some scripting. I suggest you open a support ticket and we can discuss with you. Please refer my name that the ticket is forwarded to me.
Thanks!
Andy Schwierskott
--
Andy Schwierskott
Product Manager
Altair | Nasdaq: ALTR
Connect with us:
T: www.twitter.com/altair_inc | F: www.facebook.com/altairengineering/
L: www.linkedin.com/company/altair-engineering1