Parallel jobs in HPC

Unknown
edited March 2023 in Community Q&A

Hi all,

I have a compute node with 128 cores. I want to run 2 acusolve simulations with 64 cores each in a node. I use slurm for job submission. The command I use is:

srun -n 1 -N 1 -c 64 acuRun -np 64 -libs ./libusr.so

By changing -n -N flags above, I was NOT able to run two different acusolve simulations in the cluster. I could use 64 cores and the rest was not used. I was also not able to submit two jobs to the same compute node once a job is submitted to that node.

Is this coming from acusolve side or from the cluster? 

Do you have any idea on running multiple acusolve simulations in a single compute node?

 

Thanks.

Best Answer

  • acupro
    acupro
    Altair Employee
    edited March 2023 Answer ✓

    By default, the solver looks for the input file to match the problem name.  So if you indicate

    -pb acupro_1

    it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

    acuPrep: *** ERROR: Error opening file <acupro_1.inp>
    acuPrep: *** ERROR: No RUN command found

    and the job dies.

    You can add the flag to look for the appropriate/desired input file as well, for example:

    -pb acupro_1 -inp input_1.inp

    Or however you want to point to the correct input file with your script.

Answers

  • acupro
    acupro
    Altair Employee
    edited February 2023

    I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node.  AcuSolve itself does not have that ability - you'll need to rely on your job control software.

    I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm.  You'll need to work with slurm to see if that has core-level resource handling.

  • Unknown
    edited February 2023

    I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node.  AcuSolve itself does not have that ability - you'll need to rely on your job control software.

    I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm.  You'll need to work with slurm to see if that has core-level resource handling.

    Thank you, I will try again.

  • Unknown
    edited March 2023

    I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node.  AcuSolve itself does not have that ability - you'll need to rely on your job control software.

    I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm.  You'll need to work with slurm to see if that has core-level resource handling.

    Hi acupro,

    I tried to run jobs in parallel from the same directory in my local machine. It works as expected. But when I moved to the cluster, it does not work.

    I tried submitting jobs in parallel in the cluster.

    The submission script is:

    #####################################

    #!/bin/bash

    #SBATCH --job-name="acupro_concurrent_test"
    #SBATCH -o acupro_concurrent_test_%j.out
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=36
    #SBATCH --exclusive

     

    for i in {1..3}
    do
        srun acuRun -np 12 -pb acupro_$i -libs ./libusr.so &
    done
    wait

    #####################################

    I get the following error:-

    #####################################

    acuRun: Log is redirected to: acupro_2.1.Log
    acuRun: *** ERROR: error occurred executing acuPrep
    acuRun: Thu Mar 30 15:58:34 2023
    srun: error: cn333: task 0: Exited with exit code 1
    srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
    srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
    srun: Step created for job 1910962
    acuRun: Log is redirected to: acupro_3.1.Log
    acuRun: *** ERROR: error occurred executing acuPrep
    acuRun: Thu Mar 30 15:58:42 2023
    srun: error: cn333: task 0: Exited with exit code 1
    srun: Job 1910962 step creation still disabled, retrying (Requested nodes are busy)
    srun: Step created for job 1910962
    acuRun: Log is redirected to: acupro_1.1.Log
    acuRun: *** ERROR: error occurred executing acuPrep
    acuRun: Thu Mar 30 15:58:51 2023
    srun: error: cn333: task 0: Exited with exit code 1

    #####################################

     

    By changing --cpus-per-tasks, --ntasks, etc, the jobs do not run.

    Do you have any idea on this?

     

    Thanks and regards,

    1.log 20.5K
    2.log 20.5K
    3.log 20.5K
  • acupro
    acupro
    Altair Employee
    edited March 2023 Answer ✓

    By default, the solver looks for the input file to match the problem name.  So if you indicate

    -pb acupro_1

    it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

    acuPrep: *** ERROR: Error opening file <acupro_1.inp>
    acuPrep: *** ERROR: No RUN command found

    and the job dies.

    You can add the flag to look for the appropriate/desired input file as well, for example:

    -pb acupro_1 -inp input_1.inp

    Or however you want to point to the correct input file with your script.

  • Unknown
    edited March 2023

    By default, the solver looks for the input file to match the problem name.  So if you indicate

    -pb acupro_1

    it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

    acuPrep: *** ERROR: Error opening file <acupro_1.inp>
    acuPrep: *** ERROR: No RUN command found

    and the job dies.

    You can add the flag to look for the appropriate/desired input file as well, for example:

    -pb acupro_1 -inp input_1.inp

    Or however you want to point to the correct input file with your script.

    Thank you, it is working now (although I had to make some modifications).