How to divide select appropriate resource for MIG GPU

Ian Hayhurst
Ian Hayhurst New Altair Community Member
edited October 2023 in Community Q&A

Hi,  I recently upgraded to AGE 2023.1.0 (8.8.0) for the MIG support in dividing up 3x 40 G A100 Nvidia GPU on a cluster.
I successfully partitioned each card into 7 x 5g MIG devices and assigned them as a resource gpu,

qconf -mc

#name                      shortcut                   type        relop    requestable    consumable    default    urgency    aapre    affinity    do_report    is_static

gpu &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;gpu &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;RSMAP &nbsp; &nbsp; &nbsp; <= &nbsp; &nbsp; &nbsp; YES &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;HOST &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;YES &nbsp; &nbsp; &nbsp;0.000000 &nbsp; &nbsp;YES &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NO

qconf -me <gpuhost>
complex_values        gpu=21(A100[cuda_id=0,device=/dev/nvidia0, \
                      uuid=MIG-04026c5c-e6c1-592d-8041-f512fc883c15] \
                      A100[cuda_id=1,device=/dev/nvidia1, \
                      uuid=MIG-67084d7e-6bea-54a7-830b-378ab8dad274] \
                      A100[cuda_id=2,device=/dev/nvidia2, \
                      uuid=MIG-b1e67652-6a66-5b5a-8e3f-30bc26512d15] \

and so on.... for all 21


 users request  with qsub -l gpu=1 or however many they need and AGE sets a cgroup reserved MIG ID and the job runs,  however it looks like one size doesn't suit all and  computational chemists doing FEP doing are happy with many 5g devices but computational Biologists would like less devices with more memory,  Can anyone suggest a sensible and perhaps dynamic way of  assigning different sized GPU?

other than specifying different complex values and using an arbitrary resource gpu.5g gpu.10g gpu.40g ?

 

image

Answers

  • Andy Schwierskott_22306
    Andy Schwierskott_22306
    Altair Employee
    edited October 2023

    Hi Ian,

    this is Andy from Altair from the Grid Engine team.

    There are a couple of complex considerations when it comes to MIG reconfiguration. First, an A100 in contrast to a H100 cannot be reconfigured from full GPU mode to MIG mode and vice versa without a complete reset of the device which in turn only works when no processes use the Nvidia libs. This would require to stop the Grid Engine execution daemon and the DCGM. Normal MIG reconfiguration however can be done with A100's at runtime as long as they have no running processes.

    Dynamic MIG reconfiguration is not yet supported out of the box by Grid Engine,  it is an enhancement we consider to add to Altair's workload management systems in a future release. We have a rough sketch how this could be achieved today with some scripting. I suggest you open a support ticket and we can discuss with you. Please refer my name that the ticket is forwarded to me.

    Thanks!

    Andy Schwierskott

    -- 
    Andy Schwierskott
    Product Manager
    Altair | Nasdaq: ALTR
    Connect with us:
    T: www.twitter.com/altair_inc | F: www.facebook.com/altairengineering/
    L: www.linkedin.com/company/altair-engineering