PBS Pro State down

Takayoshi Toyama
Takayoshi Toyama Altair Community Member

I ran PBS Pro for the first time in a long time and it ended up being state:down. (Currently, if you restart it, it will become state: free and you can use it)
Since the following logs existed, I suspect that there is a problem with the communication between the server and MoM.

mom_log:
01/25/2024 15:15:30;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr xx.xx.xx.xx:15001 on stream 1
01/25/2024 15:15:30;0002;pbs_mom;Svr;im_eof;Server closed connection.

sever_log:
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;svr_save_db, Unable to save server data base Execution of Prepared statement insert_svr failed: no connection to the server
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;panic_stop_db, Panic shutdown of Server on database error.  Please check PBS_HOME file system for no space condition.
01/25/2024 15:15:26;0002;Server@xxxxxxx;Svr;Server@xxxxxxxr;Stopping PBS dataservice
01/25/2024 15:15:30;0002;Server@xxxxxxx;Svr;Log;Log closed

I'm using firewall, does opening ports 15001-15009 and 17001 in TCP prevent state:down?

I'm going to set the firewall settings as follows. (port part of "firewall-cmd --list-all")
 ports: 15001-15009/tcp 17001/tcp

The PBS version uses 19.2.6.

Answers

  • Jake Goldingay
    Jake Goldingay
    Altair Employee
    edited January 31

    Hi Takayoshi,

    Please see the below ports which each daemon listens at for PBS v19.2.6

    Daemon Listening at Port Port Number Protocol Type of Communication
    pbs_server 15001 TPP (TCP) All communication to server
    pbs_mom 15002 TPP (TCP) All communication to MoM
    pbs_resmon 15003 TPP (TCP) Scheduler-MoM resource requests
    (pbs_resmon listens on this port)
    pbs_sched 15004 TPP (TCP) All communication to scheduler
    pbs_datastore 15007 proprietary PBS information storage and retrieval
    pbs_comm 17001 TPP (TCP) All communication to pbs_comm
  • Takayoshi Toyama
    Takayoshi Toyama Altair Community Member
    edited May 27

    Hi Jake, 

    Thanks for the reply.

    Unfortunately, I opened the firewall, but PBS Down occurred.

    However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.

    Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.

    PBS-20650 Address PP-822: Client not accepting response due to slow network hangs server
    PBS-23858 mom segfaults when doing a rerun with an epilogue script
    PBS-23873 Mom crashed when using release_nodes_on_stageout
    PBS-23907 Possible server crash in node_down_requeue
    PBS-24394 Buffer overflow corrupts heap in MoM child responsible for file staging
    PBS-26731 memory leak observed during stess/long running testing
    PBS-26977 address github tracker 1469: pbs_mom core dump in post_reply()
    PBS-23946 Race condition in receiving resources_used from sister MoM and sending to server
    PBS-24146 address github tracker 1469: pbs_mom core dump in post_reply()
    PBS-28846 Server deadlocks because requests issued by server to itself using the network

    In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.

    Are you sure you think that any of the above bug fixes no longer cause downs?

  • Jake Goldingay
    Jake Goldingay
    Altair Employee
    edited May 28

    Hi Jake, 

    Thanks for the reply.

    Unfortunately, I opened the firewall, but PBS Down occurred.

    However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.

    Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.

    PBS-20650 Address PP-822: Client not accepting response due to slow network hangs server
    PBS-23858 mom segfaults when doing a rerun with an epilogue script
    PBS-23873 Mom crashed when using release_nodes_on_stageout
    PBS-23907 Possible server crash in node_down_requeue
    PBS-24394 Buffer overflow corrupts heap in MoM child responsible for file staging
    PBS-26731 memory leak observed during stess/long running testing
    PBS-26977 address github tracker 1469: pbs_mom core dump in post_reply()
    PBS-23946 Race condition in receiving resources_used from sister MoM and sending to server
    PBS-24146 address github tracker 1469: pbs_mom core dump in post_reply()
    PBS-28846 Server deadlocks because requests issued by server to itself using the network

    In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.

    Are you sure you think that any of the above bug fixes no longer cause downs?

    Hi Takayoshi,

    Without knowing what the underlying cause to this issue was, we would be unable to confirm if it was related to any of the listed bugs.

    I understand you now have a working PBS install with version 2022.1.5, however if you run into this issue again please log a case on our support system and we will do our best to help you.

    Here is the link in case you need it: https://support.altair.com/