Parallel jobs in HPC
Hi all,
I have a compute node with 128 cores. I want to run 2 acusolve simulations with 64 cores each in a node. I use slurm for job submission. The command I use is:
srun -n 1 -N 1 -c 64 acuRun -np 64 -libs ./libusr.so
By changing -n -N flags above, I was NOT able to run two different acusolve simulations in the cluster. I could use 64 cores and the rest was not used. I was also not able to submit two jobs to the same compute node once a job is submitted to that node.
Is this coming from acusolve side or from the cluster?
Do you have any idea on running multiple acusolve simulations in a single compute node?
Thanks.
Best Answer
-
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
1
Answers
-
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
1 -
acupro_21778 said:
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
Thank you, I will try again.
0 -
acupro_21778 said:
I've just tested running two jobs with our internal machine, and Altair PBSPro does appear to have core-level resource handling, as it submitted two jobs to the same compute node. AcuSolve itself does not have that ability - you'll need to rely on your job control software.
I would assume that slurm has control of which resources are assigned to the job - so this would fall to slurm. You'll need to work with slurm to see if that has core-level resource handling.
Hi acupro,
I tried to run jobs in parallel from the same directory in my local machine. It works as expected. But when I moved to the cluster, it does not work.
I tried submitting jobs in parallel in the cluster.
The submission script is:
#####################################
#!/bin/bash
#SBATCH --job-name="acupro_concurrent_test"
#SBATCH -o acupro_concurrent_test_%j.out
#SBATCH --nodes=1
#SBATCH --cpus-per-task=36
#SBATCH --exclusivefor i in {1..3}
do
srun acuRun -np 12 -pb acupro_$i -libs ./libusr.so &
done
wait#####################################
I get the following error:-
#####################################
acuRun: Log is redirected to: acupro_2.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:34 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1910962 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_3.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:42 2023
srun: error: cn333: task 0: Exited with exit code 1
srun: Job 1910962 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1910962
acuRun: Log is redirected to: acupro_1.1.Log
acuRun: *** ERROR: error occurred executing acuPrep
acuRun: Thu Mar 30 15:58:51 2023
srun: error: cn333: task 0: Exited with exit code 1#####################################
By changing --cpus-per-tasks, --ntasks, etc, the jobs do not run.
Do you have any idea on this?
Thanks and regards,
0 -
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
1 -
acupro_21778 said:
By default, the solver looks for the input file to match the problem name. So if you indicate
-pb acupro_1
it will look for acupro_1.inp as the input file - and when that is not found, you see that error in the Log file
acuPrep: *** ERROR: Error opening file <acupro_1.inp>
acuPrep: *** ERROR: No RUN command foundand the job dies.
You can add the flag to look for the appropriate/desired input file as well, for example:
-pb acupro_1 -inp input_1.inp
Or however you want to point to the correct input file with your script.
Thank you, it is working now (although I had to make some modifications).
0