Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Parallel jobs in HPC

Hi all,

I have a compute node with 128 cores. I want to run 2 acusolve simulations with 64 cores each in a node. I use slurm for job submission. The command I use is:

srun -n 1 -N 1 -c 64 acuRun -np 64 -libs ./libusr.so

By changing -n -N flags above, I was NOT able to run two different acusolve simulations in the cluster. I could use 64 cores and the rest was not used. I was also not able to submit two jobs to the same compute node once a job is submitted to that node.

Is this coming from acusolve side or from the cluster?

Do you have any idea on running multiple acusolve simulations in a single compute node?

Thanks.

Find more posts tagged with

AcuSolve

English

HyperMesh

HyperMesh CFD

v2022.1

Accepted answers

acupro

By default, the solver looks for the input file to match the problem name. So if you indicate

-pb acupro_1

it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command found

and the job dies.

You can add the flag to look for the appropriate/desired input file as well, for example:

-pb acupro_1 -inp input_1.inp

Or however you want to point to the correct input file with your script.

All comments

acupro

I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.

I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.

Prabin Pradhananga_22497

I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.

I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.

Thank you, I will try again.

Prabin Pradhananga_22497

I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.

I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.

Hi acupro,

I tried to run jobs in parallel from the same directory in my local machine. It works as expected. But when I moved to the cluster, it does not work.

I tried submitting jobs in parallel in the cluster.

The submission script is:

#####################################

#!/bin/bash

#SBATCH --job-name="acupro_concurrent_test"
#SBATCH -o acupro_concurrent_test_%j.out
#SBATCH --nodes=1
#SBATCH --cpus-per-task=36
#SBATCH --exclusive

for i in {1..3}
do
srun acuRun -np 12 -pb acupro_$i -libs ./libusr.so &
done
wait

#####################################

I get the following error:-

#####################################

acuRun: Log is redirected to: acupro_2.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:34 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_3.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:42 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_1.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:51 2023
srun: error: cn333: task 0: Exited with exit code 1

#####################################

By changing --cpus-per-tasks, --ntasks, etc, the jobs do not run.

Do you have any idea on this?

Thanks and regards,

1.log

2.log

3.log

acupro

By default, the solver looks for the input file to match the problem name. So if you indicate

-pb acupro_1

it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command found

and the job dies.

You can add the flag to look for the appropriate/desired input file as well, for example:

-pb acupro_1 -inp input_1.inp

Or however you want to point to the correct input file with your script.

Prabin Pradhananga_22497

By default, the solver looks for the input file to match the problem name. So if you indicate

-pb acupro_1

it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file

acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command found

and the job dies.

You can add the flag to look for the appropriate/desired input file as well, for example:

-pb acupro_1 -inp input_1.inp

Or however you want to point to the correct input file with your script.

Thank you, it is working now (although I had to make some modifications).