Problems with Radioss in HPc Cluster
Hello
I have just started using Radioss 2017 in HPc Cluster. For the SLURM File I use the syntax like that
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=01:00:00 #SBATCH --job-name=altair-radioss #SBATCH --mem=1G # Setup input file(s) following your case INPUT1=v14_run_0000.rad INPUT2=v14_run_0001.rad # Setup custom library path if necessary export LD_LIBRARY_PATH='$HOME/my_user_library:$LD_LIBRARY_PATH' # Load Altair module module load altair/2017.2 # Run Radioss on single core if [[ ($SLURM_JOB_NUM_NODES == 1) && ($SLURM_NTASKS_PER_NODE == 1) ]]; then radioss '$INPUT1' # Run Radioss on many core elif [[ $$SLURM_JOB_NUM_NODES == 1 ]]; then radioss -nthread $SLURM_NTASKS_PER_NODE '$INPUT1' # Run Radioss on many core of many nodes else module load intel/2013/intel-mpi $ALTAIR_HOME/hwsolvers/radioss/bin/linux64/s_2017.2_linux64 -i '$INPUT1' -nt 1 -np $SLURM_JOB_NUM_NODES mpiexec $ALTAIR_HOME/hwsolvers/radioss/bin/linux64/e_2017.2_linux64_impi -i '$INPUT2' -nt 1 fi
It worked. But when I changed
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
It did not work.
Could you please explain me why?
Thank you
Answers
-
Hi,
This could due to the system limitations, as you are trying to invoke more cpus.
I recommend you to contact with the Slurm support team for a clearer explanation.
0 -
Thank George.
Could you please give me the contact detail of the Slurm team?
0 -
Sorry Pohan, Slurm is not an Altair product. Please check for their support team in online web resources.
0 -
Hi Pohan,
I would suggest you verify that your radioss mpi submission command is working correctly outside of Slurm.
First, I would recommend using the Intel MPI that ships with RADIOSS instead of another one that is installed on your system. It is located in,
altair/hw/2017/altair/mpi/linux64/intel-mpi/bin/
I think this line needs a '-n $SLURM_JOB_NUM_NODES' like this
mpiexec -n $SLURM_JOB_NUM_NODES $ALTAIR_HOME/hwsolvers/radioss/bin/linux64/e_2017.2_linux64_impi -i '$INPUT2' -nt 1
Please download the HyperWorks advanced installation guide pdf from Altair connect about other environment variables suggested for running RADIOSS with the Intel MPI.
Alternatively, you can use the script located in scripts/radioss that ships with RADIOSS and then you don't need to call mpiexec but instead the command would look something like this, where -np is the number of MPI domains.
The documentation for this radioss script in in the RADIOSS help under, RADIOSS, User Guide Run, Options/altair/hw/2017/altair/scripts/radioss -v 2017.2.1 modelinput_0000.rad -mpi i -np 48 -nt 1 -hostfile /var/spool/PBS/aux/50494.admin -mpiargs -genv KMP_AFFINITY=scatter -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_ADJUST_BCAST=1 -genv I_MPI_ADJUST_REDUCE=2 -genv I_MPI_MPIRUN_CLEANUP=1 -genv KMP_STACKSIZE=400m -genv I_MPI_FABRICS=shm:dapl -noh3d
0 -
Altair Forum User said:
Hi Pohan,
I would suggest you verify that your radioss mpi submission command is working correctly outside of Slurm.
First, I would recommend using the Intel MPI that ships with RADIOSS instead of another one that is installed on your system. It is located in,
altair/hw/2017/altair/mpi/linux64/intel-mpi/bin/
I think this line needs a '-n $SLURM_JOB_NUM_NODES' like this
mpiexec -n $SLURM_JOB_NUM_NODES $ALTAIR_HOME/hwsolvers/radioss/bin/linux64/e_2017.2_linux64_impi -i '$INPUT2' -nt 1
Please download the HyperWorks advanced installation guide pdf from Altair connect about other environment variables suggested for running RADIOSS with the Intel MPI.
Alternatively, you can use the script located in scripts/radioss that ships with RADIOSS and then you don't need to call mpiexec but instead the command would look something like this, where -np is the number of MPI domains.
The documentation for this radioss script in in the RADIOSS help under, RADIOSS, User Guide Run, Options/altair/hw/2017/altair/scripts/radioss -v 2017.2.1 modelinput_0000.rad -mpi i -np 48 -nt 1 -hostfile /var/spool/PBS/aux/50494.admin -mpiargs -genv KMP_AFFINITY=scatter -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_ADJUST_BCAST=1 -genv I_MPI_ADJUST_REDUCE=2 -genv I_MPI_MPIRUN_CLEANUP=1 -genv KMP_STACKSIZE=400m -genv I_MPI_FABRICS=shm:dapl -noh3d
Hello I tried to modify the slurm file as your suggestion but again it not work. The error is as below
/share/applications/altair/2017.2/altair/hwsolvers/radioss/bin/linux64/e_2017.2_linux64_impi: error while loading shared libraries: libmpi.so.12: cannot open shared object file: No such file or directory
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 108268 RUNNING AT styx-06-17
= EXIT CODE: 127
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
0 -
Altair Forum User said:
Hello I tried to modify the slurm file as your suggestion but again it not work. The error is as below
/share/applications/altair/2017.2/altair/hwsolvers/radioss/bin/linux64/e_2017.2_linux64_impi: error while loading shared libraries: libmpi.so.12: cannot open shared object file: No such file or directory
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 108268 RUNNING AT styx-06-17
= EXIT CODE: 127
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
You first need to verify that you can run RADIOSS outside of slurm.
Since you are running the executable directly, I would assume this is caused by an environment variable not being set correctly. The HyperWorks advanced installation guide pdf from Altair connect about other environment variables suggested for running RADIOSS with the Intel MPI.
But I would recommend you just use the radioss script that is included with HyperWorks since it sets all those environment variables for you. First try running on one node using without slurm using,
/altair/hw/2017/altair/scripts/radioss -v 2017.2.1 modelinput_0000.rad -mpi i -np 8 -nt 1 -mpiargs -genv KMP_AFFINITY=scatter -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_ADJUST_BCAST=1 -genv I_MPI_ADJUST_REDUCE=2 -genv I_MPI_MPIRUN_CLEANUP=1 -genv KMP_STACKSIZE=400m -genv I_MPI_FABRICS=shm:dapl
then make a host file which has format, node:#cores like this,
more hostfile
node1:16
node2:16
Then try on two nodes using,
/altair/hw/2017/altair/scripts/radioss -v 2017.2.1 modelinput_0000.rad -mpi i -np 48 -nt 1 -hostfile hostfile -mpiargs -genv KMP_AFFINITY=scatter -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_ADJUST_BCAST=1 -genv I_MPI_ADJUST_REDUCE=2 -genv I_MPI_MPIRUN_CLEANUP=1 -genv KMP_STACKSIZE=400m -genv I_MPI_FABRICS=shm:dapl
0 -
Thank you Andy
Now it work with one node and many cores but for many nodes and many cores it does not work.
I also found this syntax for Intel Cluster
[radioss@host1~]$ cp $ALTAIR_HOME/hwsolvers/common/bin/linux64/radflex_2017_linux64 [radioss@host1~]$ $ALTAIR_HOME/hwsolvers/radioss/bin/linux64/s_2017_linux64 –input [ROOTNAME]_0000.rad –np [Nspmd] [radioss@host1~]$ [Intel MPI path]/bin/mpirun -configfile [cgfile]
Could you tell me more about the [Intel MPI path] because I use HPC cluster so I do not know the path of Intel MPI.
0 -
Hi,
It would be much easier if you use the script that comes with HyperWorks to launch RADIOSS. Please try this the script then you don't need to do anything about the Intel MPI path as the script sets that for you.
/altair/hw/2017/altair/scripts/radioss -v 2017.2.1 modelinput_0000.rad -mpi i -np 48 -nt 1 -hostfile hostfile -mpiargs -genv KMP_AFFINITY=scatter -genv I_MPI_PIN_DOMAIN=auto -genv I_MPI_ADJUST_BCAST=1 -genv I_MPI_ADJUST_REDUCE=2 -genv I_MPI_MPIRUN_CLEANUP=1 -genv KMP_STACKSIZE=400m -genv I_MPI_FABRICS=shm:dapl
with a hostfile that contains,
more hostfile
node1:16
node2:16
Please send the output for this command.
A few more things,
Can you ssh between the nodes without entering a password?
Last, the HyperWorks installation should be accessible to all nodes in the same location by installing on a shared drive or locally in the same place on each node.
0