jobs being killed with this in the .e file "bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC"
I submit 32 simple hello world batch jobs and 1/3 or more will die with the in the .e log
bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC
All 32 jobs are identical
#!/bin/bash
#PBS -S /bin/bash
#PBS -l select=1:ncpus=4:mem=12gb:host=p0314
#PBS -q admin
#PBS -l place=shared
#PBS -N test
#PBS -l walltime=01:12:01
echo "Hello World"
echo "Sleeping for 1 minute"
sleep 60
echo "Done"
Any help would me appreciated.
Thanks,
Matthew
Answers
-
Hi Matthew,
It would be interesting to see the output of the tracejob and qstat commands for a job which has failed as this will provide some more information.
Please try running the below tracejob and qstat on the pbs_server host and another tracejob on the node p0314. Remember to replace <jobid> with the jobid of a failed job related to this issue.
tracejob <jobid>
qstat -xfw <jobid>
It may also be a good idea to check and monitor that the node p0314 is not running out of memory after you have submitted 32 of these test jobs?
0 -
HI Matthew,
Does the cluster has users created by cluster manager or it is integrated with external AD (SSSD).
Please check whether
1. getent passwd USERNAME # run this on the compute node p0314 and see whether it has all the information.
2. the shell is initiated for script submission on the compute node where the above job is failing, try an interactive submission (qsub -l selec=1:ncpus=4:mem=12gb:host=p0314 -I <hit enter>. We might test to run the job as below
Thank you
0 -
Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.
0 -
Matthew Grey said:
Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.
Thank you Matthew for this information.
0 -
Matthew Grey said:
Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.
Hello Matthew,
I am wondering if you could say what external application you were having this issue with. I am having the same issues with cgroups getting mixed together and I was hoping you had a solution.
0