PBS Pro State down
I ran PBS Pro for the first time in a long time and it ended up being state:down. (Currently, if you restart it, it will become state: free and you can use it)
Since the following logs existed, I suspect that there is a problem with the communication between the server and MoM.
mom_log:
01/25/2024 15:15:30;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr xx.xx.xx.xx:15001 on stream 1
01/25/2024 15:15:30;0002;pbs_mom;Svr;im_eof;Server closed connection.
sever_log:
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;svr_save_db, Unable to save server data base Execution of Prepared statement insert_svr failed: no connection to the server
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;panic_stop_db, Panic shutdown of Server on database error. Please check PBS_HOME file system for no space condition.
01/25/2024 15:15:26;0002;Server@xxxxxxx;Svr;Server@xxxxxxxr;Stopping PBS dataservice
01/25/2024 15:15:30;0002;Server@xxxxxxx;Svr;Log;Log closed
I'm using firewall, does opening ports 15001-15009 and 17001 in TCP prevent state:down?
I'm going to set the firewall settings as follows. (port part of "firewall-cmd --list-all")
ports: 15001-15009/tcp 17001/tcp
The PBS version uses 19.2.6.
Answers
-
Hi Takayoshi,
Please see the below ports which each daemon listens at for PBS v19.2.6
Daemon Listening at Port Port Number Protocol Type of Communication pbs_server 15001 TPP (TCP) All communication to server pbs_mom 15002 TPP (TCP) All communication to MoM pbs_resmon 15003 TPP (TCP) Scheduler-MoM resource requests
(pbs_resmon listens on this port)pbs_sched 15004 TPP (TCP) All communication to scheduler pbs_datastore 15007 proprietary PBS information storage and retrieval pbs_comm 17001 TPP (TCP) All communication to pbs_comm 0 -
Hi Jake,
Thanks for the reply.
Unfortunately, I opened the firewall, but PBS Down occurred.
However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.
Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.
PBS-20650 Address PP-822: Client not accepting response due to slow network hangs server PBS-23858 mom segfaults when doing a rerun with an epilogue script PBS-23873 Mom crashed when using release_nodes_on_stageout PBS-23907 Possible server crash in node_down_requeue PBS-24394 Buffer overflow corrupts heap in MoM child responsible for file staging PBS-26731 memory leak observed during stess/long running testing PBS-26977 address github tracker 1469: pbs_mom core dump in post_reply() PBS-23946 Race condition in receiving resources_used from sister MoM and sending to server PBS-24146 address github tracker 1469: pbs_mom core dump in post_reply() PBS-28846 Server deadlocks because requests issued by server to itself using the network In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.
Are you sure you think that any of the above bug fixes no longer cause downs?
0 -
Takayoshi Toyama said:
Hi Jake,
Thanks for the reply.
Unfortunately, I opened the firewall, but PBS Down occurred.
However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.
Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.
PBS-20650 Address PP-822: Client not accepting response due to slow network hangs server PBS-23858 mom segfaults when doing a rerun with an epilogue script PBS-23873 Mom crashed when using release_nodes_on_stageout PBS-23907 Possible server crash in node_down_requeue PBS-24394 Buffer overflow corrupts heap in MoM child responsible for file staging PBS-26731 memory leak observed during stess/long running testing PBS-26977 address github tracker 1469: pbs_mom core dump in post_reply() PBS-23946 Race condition in receiving resources_used from sister MoM and sending to server PBS-24146 address github tracker 1469: pbs_mom core dump in post_reply() PBS-28846 Server deadlocks because requests issued by server to itself using the network In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.
Are you sure you think that any of the above bug fixes no longer cause downs?
Hi Takayoshi,
Without knowing what the underlying cause to this issue was, we would be unable to confirm if it was related to any of the listed bugs.
I understand you now have a working PBS install with version 2022.1.5, however if you run into this issue again please log a case on our support system and we will do our best to help you.
Here is the link in case you need it: https://support.altair.com/
0