jobs being killed with this in the .e file "bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC"

Matthew Grey · November 2023

I submit 32 simple hello world batch jobs and 1/3 or more will die with the in the .e log
bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC

All 32 jobs are identical

#!/bin/bash

#PBS -S /bin/bash
#PBS -l select=1:ncpus=4:mem=12gb:host=p0314
#PBS -q admin
#PBS -l place=shared
#PBS -N test
#PBS -l walltime=01:12:01

echo "Hello World"
echo "Sleeping for 1 minute"
sleep 60
echo "Done"

Any help would me appreciated.

Thanks,
Matthew

Jake Goldingay · November 2023

Hi Matthew,

It would be interesting to see the output of the tracejob and qstat commands for a job which has failed as this will provide some more information.

Please try running the below tracejob and qstat on the pbs_server host and another tracejob on the node p0314. Remember to replace <jobid> with the jobid of a failed job related to this issue.

tracejob <jobid>

qstat -xfw <jobid>

It may also be a good idea to check and monitor that the node p0314 is not running out of memory after you have submitted 32 of these test jobs?

Adarsh · December 2023

HI Matthew,

Does the cluster has users created by cluster manager or it is integrated with external AD (SSSD).

Please check whether

1. getent passwd USERNAME # run this on the compute node p0314 and see whether it has all the information.

2. the shell is initiated for script submission on the compute node where the above job is failing, try an interactive submission (qsub -l selec=1:ncpus=4:mem=12gb:host=p0314 -I <hit enter>. We might test to run the job as below

Thank you

Matthew Grey · December 2023

Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

Adarsh · December 2023

Matthew Grey said:
Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

Thank you Matthew for this information.

Aaron Janssen · September 2024

Matthew Grey said:
Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

Hello Matthew,

I am wondering if you could say what external application you were having this issue with. I am having the same issues with cgroups getting mixed together and I was hoping you had a solution.

jobs being killed with this in the .e file "bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC"

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories