Feko run fails with floating point exception

Krishna_21416
Krishna_21416 New Altair Community Member
edited April 29 in Community Q&A

I am trying to run feko binary as shown below in and HPC cluster setup.

/mnt/share/codes/feko/2022.2/altair/feko/bin/runfeko /mnt/share/benchmarks/feko/generic_sedan_parametric_1000.fek -np $num_cores --machines-file machinefile -d --mpi-options $MPI_OPTIONS
 
where Mpi Options is set to -genv I_MPI_DEBUG=5 -genv I_MPI_PIN=1 -genv FI_PROVIDER=mlx -genv USE_UCX=1 -genv UCX_MAX_RNDV_RAILS=1
 
While this runs fine on few clusters on one specific cluster the run fails with below error.
 
 Feko caught signal 8 (PID 3052949)
  Memory location which caused fault: 0x3f4002e9595
 Floating point exception: Unknown exception with subcode=-6
 Feko caught signal 8 (PID 3053073)
  Memory location which caused fault: 0x3f4002e9611
 Floating point exception: Unknown exception with subcode=-6
 The following message from the master process (MYID= 0):
 ERROR    3977: Internal Feko error. Please notify the Feko support team and provide the error number, preferably together with the Feko input and output files.
 
and 
 
feko_parallel(debug): Exiting with return code 2 (0, 2, 0)
RUNFEKO(debug): Forked child process "feko_parallel" with pid = 3052771

ERROR  20011:

  Error when executing the program /mnt/share/codes/feko/2022.2/altair/feko/bin/feko_parallel
  with the options " 256 /mnt/share/benchmarks/feko/generic_sedan_parametric_1000 --machines-file machinefile -genv"
  (error codes: 2 ; 0 [Success])
  See above error message of the program for more details!
RUNFEKO(debug): Error while executing feko_parallel
 
What could be possible reasons for this failure and what would be required to fix the same.
Tagged:

Answers

  • Torben Voigt
    Torben Voigt Altair Community Member
    edited April 29

    Hi Krishna,

    I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?

    Best regards,
    Torben

  • Krishna_21416
    Krishna_21416 New Altair Community Member
    edited April 29

    Hi Krishna,

    I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?

    Best regards,
    Torben

    Frankly I dont have other small models to try running feko on our machine. I was assuming that this failure what I am observing is due to some misconfiguration or missing any parameter since the same feko binary runs fine on other machines which we have in our HPC cluster.

  • Torben Voigt
    Torben Voigt Altair Community Member
    edited April 29

    Hi Krishna,

    How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.

    (Just to test if this may be memory related)

    Best regards,
    Torben

  • Krishna_21416
    Krishna_21416 New Altair Community Member
    edited April 29

    Hi Krishna,

    How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.

    (Just to test if this may be memory related)

    Best regards,
    Torben

    I have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?

  • Torben Voigt
    Torben Voigt Altair Community Member
    edited April 29

    I have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?

    Well, I have no idea about the model and which solver is used. If MLFMM is used then will probably see an increase of memory with more cores (256 is a lot!!). Why not just try wiht 32 cores to see if it works?

    Could you attach the model maybe?

    Best regards,
    Torben