I try to run a parallel simulation of optistruct on an linux cluster:
28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core
I try to start a simulation on 3 nodes with 28 cores each. Our System Administrator uses SLURM and SBATCH. Therefore I wrote an sbatch script:
#!/bin/bash
#SBATCH -J Phantom
#SBATCH -o ./%x.%j.%N.out
#SBATCH -D ./
#SBATCH --get-user-env
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --qos=cm2_std
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=28
#SBATCH --mail-type=end
#SBATCH --export=NONE
#SBATCH --time=8:00:00
module load slurm_setup
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/2018/altair/hwsolvers/common/bin/linux64/
mpiexec -n $SLURM_NTASKS ~/2020/altair/hwsolvers/optistruct/bin/linux64/optistruct_2020.1_linux64_impi ./Phantom_v2.fem -ddmmode
Which starts optistruct as a solver, but stops working with an error message saying:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 28 PID 51302 RUNNING AT i22r04c03s07
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
and so on.
Does anybody know where my source of error is? Or which problems can be faced?
Many thanks in advance
Christian