Feko run fails with floating point exception
I am trying to run feko binary as shown below in and HPC cluster setup.
/mnt/share/codes/feko/2022.2/altair/feko/bin/runfeko /mnt/share/benchmarks/feko/generic_sedan_parametric_1000.fek -np $num_cores --machines-file machinefile -d --mpi-options $MPI_OPTIONS
-genv I_MPI_DEBUG=5 -genv I_MPI_PIN=1 -genv FI_PROVIDER=mlx -genv USE_UCX=1 -genv UCX_MAX_RNDV_RAILS=1
Feko caught signal 8 (PID 3052949)
Memory location which caused fault: 0x3f4002e9595
Floating point exception: Unknown exception with subcode=-6
Feko caught signal 8 (PID 3053073)
Memory location which caused fault: 0x3f4002e9611
Floating point exception: Unknown exception with subcode=-6
The following message from the master process (MYID= 0):
ERROR 3977: Internal Feko error. Please notify the Feko support team and provide the error number, preferably together with the Feko input and output files.
feko_parallel(debug): Exiting with return code 2 (0, 2, 0)
RUNFEKO(debug): Forked child process "feko_parallel" with pid = 3052771
ERROR 20011:
Error when executing the program /mnt/share/codes/feko/2022.2/altair/feko/bin/feko_parallel
with the options " 256 /mnt/share/benchmarks/feko/generic_sedan_parametric_1000 --machines-file machinefile -genv"
(error codes: 2 ; 0 [Success])
See above error message of the program for more details!
RUNFEKO(debug): Error while executing feko_parallel
Answers
-
Hi Krishna,
I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?
Best regards,
Torben0 -
Torben Voigt_20420 said:
Hi Krishna,
I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?
Best regards,
TorbenFrankly I dont have other small models to try running feko on our machine. I was assuming that this failure what I am observing is due to some misconfiguration or missing any parameter since the same feko binary runs fine on other machines which we have in our HPC cluster.
0 -
Hi Krishna,
How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.
(Just to test if this may be memory related)
Best regards,
Torben0 -
Torben Voigt_20420 said:
Hi Krishna,
How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.
(Just to test if this may be memory related)
Best regards,
TorbenI have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?
0 -
Krishna_21416 said:
I have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?
Well, I have no idea about the model and which solver is used. If MLFMM is used then will probably see an increase of memory with more cores (256 is a lot!!). Why not just try wiht 32 cores to see if it works?
Could you attach the model maybe?
Best regards,
Torben0