jobs being killed with this in the .e file "bash: line 1: 108753 Killed /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC"

Matthew Grey
Matthew Grey Altair Community Member
edited September 9 in Community Q&A

I submit 32 simple hello world batch jobs and 1/3 or more will die with the in the .e log
bash: line 1: 108753 Killed                  /cm/local/apps/pbspro/var/spool/mom_priv/jobs/964223.polaris.SC

 

All 32 jobs are identical

#!/bin/bash


#PBS -S /bin/bash
#PBS -l select=1:ncpus=4:mem=12gb:host=p0314
#PBS -q admin 
#PBS -l place=shared
#PBS -N test
#PBS -l walltime=01:12:01

echo "Hello World"
echo "Sleeping for 1 minute"
sleep 60 
echo "Done"

Any help would me appreciated.

Thanks,
Matthew

Answers

  • Jake Goldingay
    Jake Goldingay
    Altair Employee
    edited November 2023

    Hi Matthew,

    It would be interesting to see the output of the tracejob and qstat commands for a job which has failed as this will provide some more information.

    Please try running the below tracejob and qstat on the pbs_server host and another tracejob on the node p0314. Remember to replace <jobid> with the jobid of a failed job related to this issue.

    tracejob <jobid>

    qstat -xfw <jobid>

    It may also be a good idea to check and monitor that the node p0314 is not running out of memory after you have submitted 32 of these test jobs?

  • Adarsh_20887
    Adarsh_20887
    Altair Employee
    edited December 2023

    HI Matthew,

    Does the cluster has  users created by  cluster manager or it is integrated with external AD (SSSD).

    Please check whether

    1. getent passwd  USERNAME  #  run this on the compute node p0314 and see whether it has all the information.

    2. the shell is initiated for script submission on the compute node where the above job is failing, try an interactive submission  (qsub -l selec=1:ncpus=4:mem=12gb:host=p0314 -I  <hit enter>. We might test to run the job as below

     

    Thank you

  • Matthew Grey
    Matthew Grey Altair Community Member
    edited December 2023

    Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

  • Adarsh_20887
    Adarsh_20887
    Altair Employee
    edited December 2023

    Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

    Thank you Matthew for this information.

  • Aaron Janssen
    Aaron Janssen Altair Community Member
    edited September 10

    Hello Adarsh, I apologize for not posting earlier, but this issue has been resolved. It was caused by an external application interfering with cgroups. PIDs from a job's set was getting mixed in with other job set's. When the second job finished it would kill both jobs.

    Hello Matthew,

    I am wondering if you could say what external application you were having this issue with. I am having the same issues with cgroups getting mixed together and I was hoping you had a solution.