Feko run fails with floating point exception

Krishna_21416
Krishna_21416 Altair Community Member
edited April 2024 in Community Q&A

I am trying to run feko binary as shown below in and HPC cluster setup.

/mnt/share/codes/feko/2022.2/altair/feko/bin/runfeko /mnt/share/benchmarks/feko/generic_sedan_parametric_1000.fek -np $num_cores --machines-file machinefile -d --mpi-options $MPI_OPTIONS
 
where Mpi Options is set to -genv I_MPI_DEBUG=5 -genv I_MPI_PIN=1 -genv FI_PROVIDER=mlx -genv USE_UCX=1 -genv UCX_MAX_RNDV_RAILS=1
 
While this runs fine on few clusters on one specific cluster the run fails with below error.
 
 Feko caught signal 8 (PID 3052949)
  Memory location which caused fault: 0x3f4002e9595
 Floating point exception: Unknown exception with subcode=-6
 Feko caught signal 8 (PID 3053073)
  Memory location which caused fault: 0x3f4002e9611
 Floating point exception: Unknown exception with subcode=-6
 The following message from the master process (MYID= 0):
 ERROR    3977: Internal Feko error. Please notify the Feko support team and provide the error number, preferably together with the Feko input and output files.
 
and 
 
feko_parallel(debug): Exiting with return code 2 (0, 2, 0)
RUNFEKO(debug): Forked child process "feko_parallel" with pid = 3052771

ERROR  20011:

  Error when executing the program /mnt/share/codes/feko/2022.2/altair/feko/bin/feko_parallel
  with the options " 256 /mnt/share/benchmarks/feko/generic_sedan_parametric_1000 --machines-file machinefile -genv"
  (error codes: 2 ; 0 [Success])
  See above error message of the program for more details!
RUNFEKO(debug): Error while executing feko_parallel
 
What could be possible reasons for this failure and what would be required to fix the same.
Tagged:

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Torben Voigt
    Torben Voigt
    Altair Employee
    edited April 2024

    Hi Krishna,

    I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?

    Best regards,
    Torben

  • Krishna_21416
    Krishna_21416 Altair Community Member
    edited April 2024

    Hi Krishna,

    I'm not an expert in HPC installations, but maybe it's a problem with missing memory. Does the problem also exist for other (small) models?

    Best regards,
    Torben

    Frankly I dont have other small models to try running feko on our machine. I was assuming that this failure what I am observing is due to some misconfiguration or missing any parameter since the same feko binary runs fine on other machines which we have in our HPC cluster.

  • Torben Voigt
    Torben Voigt
    Altair Employee
    edited April 2024

    Hi Krishna,

    How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.

    (Just to test if this may be memory related)

    Best regards,
    Torben

  • Krishna_21416
    Krishna_21416 Altair Community Member
    edited April 2024

    Hi Krishna,

    How many parallel cores do you use? If it is a large MLFMM simulation you may try with less cores to reduce the memory requirement a bit.

    (Just to test if this may be memory related)

    Best regards,
    Torben

    I have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?

  • Torben Voigt
    Torben Voigt
    Altair Employee
    edited April 2024

    I have successfully ran the same on 128 and 192 core machine and now I am trying to run the same on 256 core machine with around 1 giga bytes as L3 cache. Do you suggest that even with having such configuration the run could fail ?

    Well, I have no idea about the model and which solver is used. If MLFMM is used then will probably see an increase of memory with more cores (256 is a lot!!). Why not just try wiht 32 cores to see if it works?

    Could you attach the model maybe?

    Best regards,
    Torben

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.