Solver options -nt -np

Alberto Casas
Alberto Casas Altair Community Member
edited April 2023 in Community Q&A

Hello all,

 

I have a computer of 6 cpu cores and 2 threads per core, and I was wondering how to get the fully potential of the computer while simulating. I have read about the nt and np options, but I am still processing it. In my case, to get the maximum potencial, should I use -nt 6 -np 2, right?? I really appreciate any help as I am a bit lost in this topic.

Thank u in advance!!!

Alberto 

image

Best Answer

  • PaulAltair
    PaulAltair
    Altair Employee
    edited April 2023 Answer ✓

    Hi, it depends to a degree on the size of your model (number of elements)

    -np is number of mpi (spmd) processes, -nt is number of threads per process

    If your model is 'large' then -np 12 (-nt 1) will likely give the best performance 

    A good rule of thumb is that each process should have a minimum of 5000-10000 elements. So if your model is 60k elements or larger, you can run -np 12

    If the model is smaller, (e.g. 30k elements) then you can try -np 6, -nt 2 (6 processes on 2 threads each)

    For small models (eg. 1000 - 10000 elements) then -nt 1 or -nt 2  with no -np set) is probably faster than decomposing

    A final point, Are you sure you have 2 threads per core? You aren't talking about hyperthreading? Logical (hyperthreaded) processors won't help performance much at all, if you only have 6 cores (12 logical processors) then you only really have 6 cpu available (and -np 6 would give performance as good or better than -np 12)

     

Answers

  • PaulAltair
    PaulAltair
    Altair Employee
    edited April 2023 Answer ✓

    Hi, it depends to a degree on the size of your model (number of elements)

    -np is number of mpi (spmd) processes, -nt is number of threads per process

    If your model is 'large' then -np 12 (-nt 1) will likely give the best performance 

    A good rule of thumb is that each process should have a minimum of 5000-10000 elements. So if your model is 60k elements or larger, you can run -np 12

    If the model is smaller, (e.g. 30k elements) then you can try -np 6, -nt 2 (6 processes on 2 threads each)

    For small models (eg. 1000 - 10000 elements) then -nt 1 or -nt 2  with no -np set) is probably faster than decomposing

    A final point, Are you sure you have 2 threads per core? You aren't talking about hyperthreading? Logical (hyperthreaded) processors won't help performance much at all, if you only have 6 cores (12 logical processors) then you only really have 6 cpu available (and -np 6 would give performance as good or better than -np 12)

     

  • Alberto Casas
    Alberto Casas Altair Community Member
    edited March 2023

    Hi Paul, thank you so much for the answer, but I have some questions related to it and I would be very grateful if you can help me.

    Firstly, what happens to the time step when deciding the -np? Regardless the number of cores available, the choice between -np 3 or -np 6 will only be dependent on the number of elements? If I have to reduce the time step and I decide to use more cores, should the calculation time be reduced?? 

    Regarding the 2 threads per core, indeed I mean the hyperthreading so, in total I have 12 logical processors. Talking about 6 cores and 12 logical processors, is there any difference in this case between using -np 6 -nt 2 or -np 12?? I understood that the number of threads per core should be included in -nt, not in np, is it right??

    Finally, what is the difference between mpi processes and spmd domains?? I have solved a simulation by -np 6 -nt 2, and the next text were included in the output text file:

    image

    As a final conclusion, I understand that the fastest option is not always using all the cores available, isnt´t it?

     

    Thank you so much again! I really appreciate your help!!!!!!! :)

  • PaulAltair
    PaulAltair
    Altair Employee
    edited March 2023

    Hi Paul, thank you so much for the answer, but I have some questions related to it and I would be very grateful if you can help me.

    Firstly, what happens to the time step when deciding the -np? Regardless the number of cores available, the choice between -np 3 or -np 6 will only be dependent on the number of elements? If I have to reduce the time step and I decide to use more cores, should the calculation time be reduced?? 

    Regarding the 2 threads per core, indeed I mean the hyperthreading so, in total I have 12 logical processors. Talking about 6 cores and 12 logical processors, is there any difference in this case between using -np 6 -nt 2 or -np 12?? I understood that the number of threads per core should be included in -nt, not in np, is it right??

    Finally, what is the difference between mpi processes and spmd domains?? I have solved a simulation by -np 6 -nt 2, and the next text were included in the output text file:

    image

    As a final conclusion, I understand that the fastest option is not always using all the cores available, isnt´t it?

     

    Thank you so much again! I really appreciate your help!!!!!!! :)

    Ok! some more general info

    mpi processes = spmd domains = -np

    threads per domain = -nt

    total number of cpu used = nt * np

    So if you use -np 12 -nt 1, you will get 12 processes running on 1 thread each, and -np 6 -nt 2 is 6 processes running on 2 threads each for a total of 12.

    Normally you would just use 'smp' (-np 1 -nt 6) or 'pure mpi' (-np = 6 -nt =1), using the hybrid approach (e.g. -np 3, -nt 2) probably isn't worth is on your machine, it can be useful for models on larger machines with 100s of cpu.

    The -nt is just the number of threads each mpi process runs on, not specifically the number of threads your machine has, (it shouldn't be more than the physical number of threads available on the cpu, but in your case, that is 6, not 2)

    In your case, you actually have 1 core with 6 threads (and 6 extra hyperthread 'logical' threads), not 6 cores with 2 threads, you really only have 6 'useful' cpu for Radioss, you will not see more than a fractional speedup for any 7-12 cpu configurations.

    HyperThreading works by utilising unused clock cycles in desktop applications, when you are running Radioss it is already maxxing out the 6 threads, so there are no spare clock cycles to use for HyperThreading.

    If you are also using your machine for other things while the job is running (web browser, office, email etc) then you may find using all 6 for solving is not ideal either!

    If your model is very small (e.g. a single element or a tensile test model with only 2-300 elements) then running on more than 1 cpu may actually slow the job down.

    There is no link between timestep and number of cores used.

    If e.g. you have a 100000 element model with a timestep of 1e-6s and termination time of 0.1s, none of those things need to change to run it on 1 cpu (-np 1 -nt 1) or 6 (-np 6 -nt 1: or -np 1 -nt 6), the 6 cpu version should run significantly faster in either case and I would expect the -np 6 to be marginally quicker than -nt 6.

    The easiest thing to do for you to get an understanding of it is to try running your model on a few combinations and see how performance changes (e.g. try -nt 1. then try -nt 4, then -np 4)

     

     

     

  • Alberto Casas
    Alberto Casas Altair Community Member
    edited April 2023

    Hello Paul,

    Sorry for the late answer but I really appreciate it! After some simulations, I believe I grasp the concept and difference between nt and np.

    Thank you so much!!