Error 158: Error encountered when accessing the scratch file
I am receiving the following error while trying to run a job on a Linux cluster:
(Scratch disk space usage for starting iteration = 2704 MB) *** ERROR # 158 *** Error encountered when accessing the scratch file. error from subroutine xdslrs Solver error no. = -702 This may be caused by insufficient disk space or some other system resource related limitations. Try one or more of the following workarounds if you think otherwise and if this issue happens on Windows. - Resubmit a job. - Avoid writing scratch files in the drive where the Operating System is installed (start the job on other drive or use TMPDIR/-tmpdir options). - Disable real time protection at minimum for files with extension rs~ . - Use of environment variable OS_SCRATCH_EXT=txt may help. This error was detected in subroutine adjslvtm. *** ERROR # 5019 *** Specified temprature vectors (1 - 1) out of allowed range (1 - 0). This error occurs in module 'snsdrv'. ************************************************************************ RESOURCE USAGE INFORMATION -------------------------- MAXIMUM MEMORY USED 4985 MB IN ADDITION 177 MB MEMORY WAS ALLOCATED FOR TEMPORARY USE INCLUDING MEMORY FOR MUMPS 2929 MB MAXIMUM DISK SPACE USED 5939 MB INCLUDING DISK SPACE FOR MUMPS 3934 MB ************************************************************************
I've tried some the troubleshooting suggestions in the error log without any luck:
- I've specified the scratch drive to a large file system (petabytes of storage)
- I've set OS_SCRATCH_EXT=txt
- Since this is a high performance computing environment, I don't have the ability to access 'real time protection' options; however, I'm not aware of any virus protection running on the cluster and, furthermore, the OS_SCRATCH_EXT=txt should fix virus scan issues as I expect the system would let text files pass.
I should note that I am trying to run this problem as a parallel job with the following command:
$ALTAIR_HOME/scripts/invoke/init_solver.sh -mpi pl -ddm -np 2 -nt 8 -scr $SCRATCH -outfile $WORK <filename>.fem
Below are some other relevant notes:
- If I try to run this job in serial (i.e., without the -mpi pl -ddm -np 2), I don't experience the above error; so, this appears to be something that arises when trying to run mpi jobs.
- I've tried running this job with -mpi i and my system doesn't seems to be setup for intel based mpi (unable to find required .so files).
- Cluster node information:
- Dual Socket
- Xeon E5-2690 v3 (Haswell) : 12 cores per socket (24 cores/node), 2.6 GHz
- 64 GB DDR4-2133 (8 x 8GB dual rank x8 DIMMS)
- Hyperthreading Enabled - 48 threads (logical CPUs) per node
- The $SCRATCH drive I'm pointing to is a network drive. I tried running -scr slow=1,$SCRATCH, but still get the same error.
- When I make the above ddm call, I have requested 2x nodes and a total of 48 mpi processes (although I'm not using them all)
Thoughts?
Answers
-
- Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!
- Never use 'Hyperthreading enabled' for FEA on HPC Cluster
(Note: I work daily with HPC Linux cluster for FEA)
0 -
Thanks for the response.
Altair Forum User said:- Scratch drive is often a local fast (SSD) disk for performance. Don't use Network drive!
Unfortunately, I have very limited space allocated on my HPC local drive (1GB). So, for big jobs, I have to offload the scratch work to a network scratch drive. Ideally, I can run everything in core, but sometimes that isn't an option either.
Altair Forum User said:- Never use 'Hyperthreading enabled' for FEA on HPC Cluster
Is that due to HPC limitations or just that it isn't a good idea for optistruct in general? I ask because my cluster supposedly have Hyperthreading available on the CPUs (not that I've been using it though).
0 -
Your local drive on HPC is only 1GB ???
No FEA apps can use Intel 'HT' feature. Even worse, you loss performance.
0