Error 158: Error encountered when accessing the scratch file

watkinrt
watkinrt Altair Community Member
edited October 2020 in Community Q&A

I am receiving the following error while trying to run a job on a Linux cluster:

 (Scratch disk space usage for starting iteration = 2704 MB)       *** ERROR # 158 ***  Error encountered when accessing the scratch file.     error from subroutine xdslrs      Solver error no. =   -702   This may be caused by insufficient disk space or some other  system resource related limitations.  Try one or more of the following workarounds if you think otherwise   and if this issue happens on Windows.   - Resubmit a job.   - Avoid writing scratch files in the drive where the Operating System is     installed (start the job on other drive or use TMPDIR/-tmpdir options).    - Disable real time protection at minimum for files with extension rs~ .  - Use of environment variable OS_SCRATCH_EXT=txt may help.   This error was detected in subroutine adjslvtm.    *** ERROR # 5019 ***  Specified temprature vectors (1 - 1) out of allowed range (1 - 0).    This error occurs in module 'snsdrv'.  ************************************************************************  RESOURCE USAGE INFORMATION --------------------------   MAXIMUM MEMORY USED                                      4985 MB    IN ADDITION    177 MB MEMORY WAS ALLOCATED FOR TEMPORARY USE    INCLUDING MEMORY FOR MUMPS                             2929 MB  MAXIMUM DISK SPACE USED                                  5939 MB    INCLUDING DISK SPACE FOR MUMPS                         3934 MB  ************************************************************************

I've tried some the troubleshooting suggestions in the error log without any luck:

  • I've specified the scratch drive to a large file system (petabytes of storage)
  • I've set OS_SCRATCH_EXT=txt
  • Since this is a high performance computing environment, I don't have the ability to access 'real time protection' options; however, I'm not aware of any virus protection running on the cluster and, furthermore, the OS_SCRATCH_EXT=txt should fix virus scan issues as I expect the system would let text files pass.

 

I should note that I am trying to run this problem as a parallel job with the following command:

 

 $ALTAIR_HOME/scripts/invoke/init_solver.sh -mpi pl -ddm -np 2 -nt 8 -scr $SCRATCH -outfile $WORK <filename>.fem

Below are some other relevant notes:

 

  • If I try to run this job in serial (i.e., without the -mpi pl -ddm -np 2), I don't experience the above error; so, this appears to be something that arises when trying to run mpi jobs.
  • I've tried running this job with -mpi i and my system doesn't seems to be setup for intel based mpi (unable to find required .so files).
  • Cluster node information:
    • Dual Socket
    • Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
    • 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
    • Hyperthreading Enabled - 48 threads (logical CPUs) per node
  • The $SCRATCH drive I'm pointing to is a network drive. I tried running -scr slow=1,$SCRATCH, but still get the same error.
  • When I make the above ddm call, I have requested 2x nodes and a total of 48 mpi processes (although I'm not using them all)

 

Thoughts?

Answers

  • Q.Nguyen-Dai
    Q.Nguyen-Dai Altair Community Member
    edited March 2020
    • Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!
    • Never use 'Hyperthreading enabled' for FEA on HPC Cluster

    (Note: I work daily with HPC Linux cluster for FEA)

     

     

     

  • watkinrt
    watkinrt Altair Community Member
    edited March 2020

    Thanks for the response. 

     

    • Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!

    Unfortunately, I have very limited space allocated on my HPC local drive (1GB). So, for big jobs, I have to offload the scratch work to a network scratch drive. Ideally, I can run everything in core, but sometimes that isn't an option either.

     

    • Never use 'Hyperthreading enabled' for FEA on HPC Cluster

     Is that due to HPC limitations or just that it isn't a good idea for optistruct in general? I ask because my cluster supposedly have Hyperthreading available on the CPUs (not that I've been using it though).

  • Q.Nguyen-Dai
    Q.Nguyen-Dai Altair Community Member
    edited March 2020

    Your local drive on HPC is only 1GB ???

    No FEA apps can use Intel 'HT' feature. Even worse, you loss performance.