Parallel jobs in HPC

Hi all,
I have a compute node with 128 cores. I want to run 2 acusolve simulations with 64 cores each in a node. I use slurm for job submission. The command I use is:
srun -n 1 -N 1 -c 64 acuRun -np 64 -libs ./
By changing -n -N flags above, I was NOT able to run two different acusolve simulations in the cluster. I could use 64 cores and the rest was not used. I was also not able to submit two jobs to the same compute node once a job is submitted to that node.
Is this coming from acusolve side or from the cluster?
Do you have any idea on running multiple acusolve simulations in a single compute node?
Best Answer
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
1 -
acupro_21778 said:
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
Thank you, I will try again.
0 -
acupro_21778 said:
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
Hi acupro,
I tried to run jobs in parallel from the same directory in my local machine. It works as expected. But when I moved to the cluster, it does not work.
I tried submitting jobs in parallel in the cluster.
The submission script is:
#SBATCH --job-name="acupro_concurrent_test"
#SBATCH -o acupro_concurrent_test_%j.out
#SBATCH --nodes=1
#SBATCH --cpus-per-task=36
#SBATCH --exclusivefor i in {1..3}
srun acuRun -np 12 -pb acupro_$i -libs ./ &
I get the following error:-
acuRun: Log is redirected to: acupro_2.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:34 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_3.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:42 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_1.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:51 2023
srun: error: cn333: task 0: Exited with exit code 1#####################################
By changing --cpus-per-tasks, --ntasks, etc, the jobs do not run.
Do you have any idea on this?
Thanks and regards,
0 -
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
1 -
acupro_21778 said:
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
Thank you, it is working now (although I had to make some modifications).