Jobs not running on AGE, --------------STOP-SCHEDULER-RUN-------------
Hello,
We are running Altair Grid Engine v8.6.7 on Linux Centos 7.9.
It is working fine since a long time.
We had to reboot our master node, executing the sge_qmaster service.
Since this time, jobs are unable to enter in a running state.
I have rebooted one execution node. But there is no change.
Here is what I have on the master node :
[root@OSAKA-NEW ~]$ systemctl status ugemaster
● ugemaster.service - Univa Grid Engine Qmaster
Loaded: loaded (/usr/lib/systemd/system/ugemaster.service; enabled; vendor preset: disabled)
Active: active (running) since lun. 2023-12-18 13:27:33 CET; 2 days ago
Main PID: 2255 (sge_qmaster)
CGroup: /system.slice/ugemaster.service
└─2255 /usr/uge/bin/lx-amd64/sge_qmaster
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: Q:310, AQ:310 J:213(213), H:72(72), C:99, A:29, D:16, P:34, CKPT:1, US:19, PR:0, RQS:2, AR:0, S:nd:0/lf:0
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: --------------STOP-SCHEDULER-RUN-------------
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: Q:310, AQ:310 J:213(213), H:72(72), C:99, A:29, D:16, P:34, CKPT:1, US:19, PR:0, RQS:2, AR:0, S:nd:0/lf:0
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: --------------STOP-SCHEDULER-RUN-------------
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: Q:310, AQ:310 J:213(213), H:72(72), C:99, A:29, D:16, P:34, CKPT:1, US:19, PR:0, RQS:2, AR:0, S:nd:0/lf:0
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: --------------STOP-SCHEDULER-RUN-------------
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: Q:310, AQ:310 J:213(213), H:72(72), C:99, A:29, D:16, P:34, CKPT:1, US:19, PR:0, RQS:2, AR:0, S:nd:0/lf:0
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: --------------STOP-SCHEDULER-RUN-------------
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: Q:310, AQ:310 J:213(213), H:72(72), C:99, A:29, D:16, P:34, CKPT:1, US:19, PR:0, RQS:2, AR:0, S:nd:0/lf:0
déc. 20 20:50:35 noeud01-osaka.ucp sge_qmaster[2255]: --------------STOP-SCHEDULER-RUN-------------
I don't understand the message "STOP-SCHEDULER-RUN".
And here is what I have on the execution node rebooted :
[root@hdr09-osaka ~]# systemctl status uge_execd
● uge_execd.service - Univa GridEngine execution daemon
Loaded: loaded (/usr/lib/systemd/system/uge_execd.service; enabled; vendor preset: disabled)
Active: active (running) since mer. 2023-12-20 20:25:00 CET; 27min ago
Main PID: 1871 (sge_execd)
Tasks: 7
Memory: 37.9M
CGroup: /system.slice/uge_execd.service
├─1871 /usr/uge/bin/lx-amd64/sge_execd
└─2152 /bin/bash /usr/uge/osaka/common/load_sensor.sh
déc. 20 20:25:00 hdr09-osaka.ucp sge_execd[1871]: using "none" for default_jc
déc. 20 20:25:00 hdr09-osaka.ucp sge_execd[1871]: using "cgroup_path=/sys/fs/cgroup cpuset=true mount=true subdir_name=UGE freezer=true freeze_pe_tasks=true killing=true forced_numa=true...groups_params
déc. 20 20:25:00 hdr09-osaka.ucp sge_execd[1871]: using "none" for port_range
déc. 20 20:25:00 hdr09-osaka.ucp sge_execd[1871]: registered at qmaster host "noeud01-osaka.ucp"
déc. 20 20:25:00 hdr09-osaka.ucp sge_execd[1871]: starting up Univa Grid Engine UGE 8.6.7 (lx-amd64)
déc. 20 20:25:01 hdr09-osaka.ucp sge_execd[1871]: starting with processing
déc. 20 20:25:01 hdr09-osaka.ucp sge_execd[1871]: gid range observation is enabled
déc. 20 20:25:01 hdr09-osaka.ucp sge_execd[1871]: using "5" for "pdc_interval"
déc. 20 20:25:01 hdr09-osaka.ucp sge_execd[1871]: using "120" for "pdc_cache_update_timeout"
déc. 20 20:25:01 hdr09-osaka.ucp sge_execd[1871]: blocked additional group ids: 0/101
Hint: Some lines were ellipsized, use -l to show in full.
Here are the explanations on a queue wait job which doesn't succed to enter in a running state :
[root@OSAKA-NEW ~]$ qstat -F -j 647986
==============================================================
job_number: 647986
jclass: NONE
exec_file: job_scripts/647986
submission_time: 12/20/2023 20:15:42.592
owner: ycostes
uid: 52055
group: cdc
gid: 152
supplementary group: cdc, chim
sge_o_home: /u/cdc/ycostes
sge_o_log_name: ycostes
sge_o_path: /usr/local/modules/5.0.1/bin:/usr/local/matlab-R2023b/bin:/usr/local/osaka_help//2023c/bin:/usr/local/intel/advisor_2020.0.0.604394/bin64:/usr/local/intel/vtune_profiler_2020.0.0.605129/bin64:/usr/local/intel/inspector_2020.0.0.603904/bin64:/usr/local/intel/itac/2020.0.015/intel64/bin:/usr/local/intel/clck/2019.6/bin/intel64:/usr/local/intel/compilers_and_libraries_2020.0.166/linux/bin/intel64:/usr/local/intel/compilers_and_libraries_2020.0.166/linux/bin:/usr/local/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/libfabric/bin:/usr/local/intel/compilers_and_libraries_2020.0.166/linux/mpi/intel64/bin:/usr/local/intel/debugger_2020/gdb/intel64/bin:/usr/uge/bin/lx-amd64:/usr/local/miniconda/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/intel/parallel_studio_xe_2020.0.088/bin:/opt/dell/srvadmin/bin
sge_o_shell: /bin/bash
sge_o_workdir: /u/cdc/ycostes
sge_o_host: frontal01-osaka
account: sge
cwd: /u/cdc/ycostes
merge: y
hard resource_list: m_mem_free=1G,partition=HEPIQ
mail_options: abes
mail_list: yann.costes@cyu.fr
notify: FALSE
job_name: test
priority: 0
jobshare: 0
env_list:
script_file: test-long1.job
parallel environment: openmp range: 2
department: cdc
binding: set linear_per_task:1
mbind: cores:strict
submit_cmd: qsub test-long1.job
category_id: 46
request_dispatch_info: FALSE
scheduling info: queue instance "seq_long@noeud39-osaka.ucp" dropped because host slots are full (similar reason #1)
queue instance "seq_medium@noeud39-osaka.ucp" dropped because host slots are full (similar reason #2)
queue instance "seq_short@noeud39-osaka.ucp" dropped because host slots are full (similar reason #3)
queue instance "test@noeud39-osaka.ucp" dropped because host slots are full (similar reason #4)
queue instance "test2@noeud39-osaka.ucp" dropped because host slots are full (similar reason #5)
queue instance "para_long@noeud39-osaka.ucp" dropped because host slots are full (similar reason #6)
queue instance "seq_long@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.99 (similar reason #1)
queue instance "seq_medium@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.99 (similar reason #2)
queue instance "seq_short@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.99 (similar reason #3)
queue instance "test@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.99 (similar reason #4)
queue instance "test2@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.99 (similar reason #5)
queue instance "para_long@noeud60-osaka.ucp" dropped because it is overloaded: np_load_short=1.100556 (no load adjustment) >= 0.95 (similar reason #6)
queue instance "seq_long@gpu01-osaka.ucp" dropped because it is disabled (similar reason #1)
queue instance "seq_long@noeud26-osaka.ucp" dropped because it is disabled (similar reason #2)
queue instance "seq_long@noeud25-osaka.ucp" dropped because it is disabled (similar reason #3)
queue instance "seq_medium@gpu01-osaka.ucp" dropped because it is disabled (similar reason #4)
queue instance "seq_medium@noeud26-osaka.ucp" dropped because it is disabled (similar reason #5)
queue instance "seq_medium@noeud25-osaka.ucp" dropped because it is disabled (similar reason #6)
queue instance "seq_short@gpu01-osaka.ucp" dropped because it is disabled (similar reason #7)
queue instance "seq_short@noeud26-osaka.ucp" dropped because it is disabled (similar reason #8)
queue instance "seq_short@noeud25-osaka.ucp" dropped because it is disabled (similar reason #9)
queue instance "test@noeud67-osaka.ucp" dropped because it is disabled (similar reason #10)
queue instance "test@noeud77-osaka.ucp" dropped because it is disabled (similar reason #11)
queue instance "test@gpu01-osaka.ucp" dropped because it is disabled (similar reason #12)
queue instance "test@noeud26-osaka.ucp" dropped because it is disabled (similar reason #13)
queue instance "test@noeud31-osaka.ucp" dropped because it is disabled (similar reason #14)
queue instance "test@noeud46-osaka.ucp" dropped because it is disabled (similar reason #15)
queue instance "test@noeud25-osaka.ucp" dropped because it is disabled (similar reason #16)
queue instance "test@noeud75-osaka.ucp" dropped because it is disabled (similar reason #17)
queue instance "test@noeud74-osaka.ucp" dropped because it is disabled (similar reason #18)
queue instance "test@noeud49-osaka.ucp" dropped because it is disabled (similar reason #19)
queue instance "test@noeud44-osaka.ucp" dropped because it is disabled (similar reason #20)
queue instance "test@noeud69-osaka.ucp" dropped because it is disabled (similar reason #21)
queue instance "test@noeud33-osaka.ucp" dropped because it is disabled (similar reason #22)
queue instance "test@noeud58-osaka.ucp" dropped because it is disabled (similar reason #23)
queue instance "test@noeud59-osaka.ucp" dropped because it is disabled (similar reason #24)
queue instance "test@noeud51-osaka.ucp" dropped because it is disabled (similar reason #25)
queue instance "test@noeud37-osaka.ucp" dropped because it is disabled (similar reason #26)
queue instance "test@noeud63-osaka.ucp" dropped because it is disabled (similar reason #27)
queue instance "test@noeud45-osaka.ucp" dropped because it is disabled (similar reason #28)
queue instance "test@noeud47-osaka.ucp" dropped because it is disabled (similar reason #29)
queue instance "test@noeud30-osaka.ucp" dropped because it is disabled (similar reason #30)
queue instance "test@noeud61-osaka.ucp" dropped because it is disabled (similar reason #31)
queue instance "test@noeud66-osaka.ucp" dropped because it is disabled (similar reason #32)
queue instance "test@noeud28-osaka.ucp" dropped because it is disabled (similar reason #33)
queue instance "test@noeud29-osaka.ucp" dropped because it is disabled (similar reason #34)
queue instance "test@noeud27-osaka.ucp" dropped because it is disabled (similar reason #35)
queue instance "test@noeud36-osaka.ucp" dropped because it is disabled (similar reason #36)
queue instance "test@noeud48-osaka.ucp" dropped because it is disabled (similar reason #37)
queue instance "test@noeud50-osaka.ucp" dropped because it is disabled (similar reason #38)
queue instance "test@noeud57-osaka.ucp" dropped because it is disabled (similar reason #39)
queue instance "test@noeud56-osaka.ucp" dropped because it is disabled (similar reason #40)
queue instance "test@noeud72-osaka.ucp" dropped because it is disabled (similar reason #41)
queue instance "test@noeud62-osaka.ucp" dropped because it is disabled (similar reason #42)
queue instance "test@noeud79-osaka.ucp" dropped because it is disabled (similar reason #43)
queue instance "test@noeud35-osaka.ucp" dropped because it is disabled (similar reason #44)
queue instance "test@noeud83-osaka.ucp" dropped because it is disabled (similar reason #45)
queue instance "test@noeud68-osaka.ucp" dropped because it is disabled (similar reason #46)
queue instance "test@noeud81-osaka.ucp" dropped because it is disabled (similar reason #47)
queue instance "test@noeud52-osaka.ucp" dropped because it is disabled (similar reason #48)
queue instance "test@noeud82-osaka.ucp" dropped because it is disabled (similar reason #49)
queue instance "test@noeud32-osaka.ucp" dropped because it is disabled (similar reason #50)
queue instance "test@noeud80-osaka.ucp" dropped because it is disabled (similar reason #51)
queue instance "test@noeud53-osaka.ucp" dropped because it is disabled (similar reason #52)
queue instance "test@noeud84-osaka.ucp" dropped because it is disabled (similar reason #53)
queue instance "test@noeud87-osaka.ucp" dropped because it is disabled (similar reason #54)
queue instance "test@noeud88-osaka.ucp" dropped because it is disabled (similar reason #55)
queue instance "test@noeud85-osaka.ucp" dropped because it is disabled (similar reason #56)
queue instance "test@noeud64-osaka.ucp" dropped because it is disabled (similar reason #57)
queue instance "test@noeud73-osaka.ucp" dropped because it is disabled (similar reason #58)
queue instance "test@noeud70-osaka.ucp" dropped because it is disabled (similar reason #59)
queue instance "test@noeud71-osaka.ucp" dropped because it is disabled (similar reason #60)
queue instance "test@noeud76-osaka.ucp" dropped because it is disabled (similar reason #61)
queue instance "test@noeud43-osaka.ucp" dropped because it is disabled (similar reason #62)
queue instance "test@noeud65-osaka.ucp" dropped because it is disabled (similar reason #63)
queue instance "test@noeud86-osaka.ucp" dropped because it is disabled (similar reason #64)
queue instance "test@noeud78-osaka.ucp" dropped because it is disabled (similar reason #65)
queue instance "test2@noeud67-osaka.ucp" dropped because it is disabled (similar reason #66)
queue instance "test2@noeud77-osaka.ucp" dropped because it is disabled (similar reason #67)
queue instance "test2@gpu01-osaka.ucp" dropped because it is disabled (similar reason #68)
queue instance "test2@noeud26-osaka.ucp" dropped because it is disabled (similar reason #69)
queue instance "test2@noeud31-osaka.ucp" dropped because it is disabled (similar reason #70)
queue instance "test2@noeud46-osaka.ucp" dropped because it is disabled (similar reason #71)
queue instance "test2@noeud25-osaka.ucp" dropped because it is disabled (similar reason #72)
queue instance "test2@noeud75-osaka.ucp" dropped because it is disabled (similar reason #73)
queue instance "test2@noeud74-osaka.ucp" dropped because it is disabled (similar reason #74)
queue instance "test2@noeud49-osaka.ucp" dropped because it is disabled (similar reason #75)
queue instance "test2@noeud44-osaka.ucp" dropped because it is disabled (similar reason #76)
queue instance "test2@noeud69-osaka.ucp" dropped because it is disabled (similar reason #77)
queue instance "test2@noeud33-osaka.ucp" dropped because it is disabled (similar reason #78)
queue instance "test2@noeud58-osaka.ucp" dropped because it is disabled (similar reason #79)
queue instance "test2@noeud59-osaka.ucp" dropped because it is disabled (similar reason #80)
queue instance "test2@noeud51-osaka.ucp" dropped because it is disabled (similar reason #81)
queue instance "test2@noeud37-osaka.ucp" dropped because it is disabled (similar reason #82)
queue instance "test2@noeud63-osaka.ucp" dropped because it is disabled (similar reason #83)
queue instance "test2@noeud45-osaka.ucp" dropped because it is disabled (similar reason #84)
queue instance "test2@noeud47-osaka.ucp" dropped because it is disabled (similar reason #85)
queue instance "test2@noeud30-osaka.ucp" dropped because it is disabled (similar reason #86)
queue instance "test2@noeud61-osaka.ucp" dropped because it is disabled (similar reason #87)
queue instance "test2@noeud66-osaka.ucp" dropped because it is disabled (similar reason #88)
queue instance "test2@noeud28-osaka.ucp" dropped because it is disabled (similar reason #89)
queue instance "test2@noeud29-osaka.ucp" dropped because it is disabled (similar reason #90)
queue instance "test2@noeud27-osaka.ucp" dropped because it is disabled (similar reason #91)
queue instance "test2@noeud36-osaka.ucp" dropped because it is disabled (similar reason #92)
queue instance "test2@noeud48-osaka.ucp" dropped because it is disabled (similar reason #93)
queue instance "test2@noeud50-osaka.ucp" dropped because it is disabled (similar reason #94)
queue instance "test2@noeud57-osaka.ucp" dropped because it is disabled (similar reason #95)
queue instance "test2@noeud56-osaka.ucp" dropped because it is disabled (similar reason #96)
queue instance "test2@noeud72-osaka.ucp" dropped because it is disabled (similar reason #97)
queue instance "test2@noeud62-osaka.ucp" dropped because it is disabled (similar reason #98)
queue instance "test2@noeud79-osaka.ucp" dropped because it is disabled (similar reason #99)
queue instance "test2@noeud35-osaka.ucp" dropped because it is disabled (similar reason #100)
queue instance "test2@noeud83-osaka.ucp" dropped because it is disabled (similar reason #101)
queue instance "test2@noeud68-osaka.ucp" dropped because it is disabled (similar reason #102)
queue instance "test2@noeud81-osaka.ucp" dropped because it is disabled (similar reason #103)
queue instance "test2@noeud52-osaka.ucp" dropped because it is disabled (similar reason #104)
queue instance "test2@noeud82-osaka.ucp" dropped because it is disabled (similar reason #105)
queue instance "test2@noeud32-osaka.ucp" dropped because it is disabled (similar reason #106)
queue instance "test2@noeud80-osaka.ucp" dropped because it is disabled (similar reason #107)
queue instance "test2@noeud53-osaka.ucp" dropped because it is disabled (similar reason #108)
queue instance "test2@noeud84-osaka.ucp" dropped because it is disabled (similar reason #109)
queue instance "test2@noeud87-osaka.ucp" dropped because it is disabled (similar reason #110)
queue instance "test2@noeud88-osaka.ucp" dropped because it is disabled (similar reason #111)
queue instance "test2@noeud85-osaka.ucp" dropped because it is disabled (similar reason #112)
queue instance "test2@noeud64-osaka.ucp" dropped because it is disabled (similar reason #113)
queue instance "test2@noeud73-osaka.ucp" dropped because it is disabled (similar reason #114)
queue instance "test2@noeud70-osaka.ucp" dropped because it is disabled (similar reason #115)
queue instance "test2@noeud71-osaka.ucp" dropped because it is disabled (similar reason #116)
queue instance "test2@noeud76-osaka.ucp" dropped because it is disabled (similar reason #117)
queue instance "test2@noeud43-osaka.ucp" dropped because it is disabled (similar reason #118)
queue instance "test2@noeud65-osaka.ucp" dropped because it is disabled (similar reason #119)
queue instance "test2@noeud86-osaka.ucp" dropped because it is disabled (similar reason #120)
queue instance "test2@noeud78-osaka.ucp" dropped because it is disabled (similar reason #121)
cannot run in PE "openmp" because resource requirements of the job cannot be fulfilled
Thanks in advance for your help.
Best regards,
Y. Costes