Hi, I recently upgraded to AGE 2023.1.0 (8.8.0) for the MIG support in dividing up 3x 40 G A100 Nvidia GPU on a cluster.
I successfully partitioned each card into 7 x 5g MIG devices and assigned them as a resource gpu,
qconf -mc
#name shortcut type relop requestable consumable default urgency aapre affinity do_report is_static
gpu gpu RSMAP <= YES HOST 0 0 YES 0.000000 YES NO
qconf -me <gpuhost>
complex_values gpu=21(A100[cuda_id=0,device=/dev/nvidia0, \
uuid=MIG-04026c5c-e6c1-592d-8041-f512fc883c15] \
A100[cuda_id=1,device=/dev/nvidia1, \
uuid=MIG-67084d7e-6bea-54a7-830b-378ab8dad274] \
A100[cuda_id=2,device=/dev/nvidia2, \
uuid=MIG-b1e67652-6a66-5b5a-8e3f-30bc26512d15] \
and so on.... for all 21
users request with qsub -l gpu=1 or however many they need and AGE sets a cgroup reserved MIG ID and the job runs, however it looks like one size doesn't suit all and computational chemists doing FEP doing are happy with many 5g devices but computational Biologists would like less devices with more memory, Can anyone suggest a sensible and perhaps dynamic way of assigning different sized GPU?
other than specifying different complex values and using an arbitrary resource gpu.5g gpu.10g gpu.40g ?
