PBS Pro State down
I ran PBS Pro for the first time in a long time and it ended up being state:down. (Currently, if you restart it, it will become state: free and you can use it)
Since the following logs existed, I suspect that there is a problem with the communication between the server and MoM.
mom_log:
01/25/2024 15:15:30;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr xx.xx.xx.xx:15001 on stream 1
01/25/2024 15:15:30;0002;pbs_mom;Svr;im_eof;Server closed connection.
sever_log:
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;svr_save_db, Unable to save server data base Execution of Prepared statement insert_svr failed: no connection to the server
01/25/2024 15:15:26;0001;Server@xxxxxxx;Svr;Server@xxxxxxx;panic_stop_db, Panic shutdown of Server on database error. Please check PBS_HOME file system for no space condition.
01/25/2024 15:15:26;0002;Server@xxxxxxx;Svr;Server@xxxxxxxr;Stopping PBS dataservice
01/25/2024 15:15:30;0002;Server@xxxxxxx;Svr;Log;Log closed
I'm using firewall, does opening ports 15001-15009 and 17001 in TCP prevent state:down?
I'm going to set the firewall settings as follows. (port part of "firewall-cmd --list-all")
ports: 15001-15009/tcp 17001/tcp
The PBS version uses 19.2.6.
Find more posts tagged with
Hi Jake,
Thanks for the reply.
Unfortunately, I opened the firewall, but PBS Down occurred.
However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.
Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.
PBS-20650 | Address PP-822: Client not accepting response due to slow network hangs server |
PBS-23858 | mom segfaults when doing a rerun with an epilogue script |
PBS-23873 | Mom crashed when using release_nodes_on_stageout |
PBS-23907 | Possible server crash in node_down_requeue |
PBS-24394 | Buffer overflow corrupts heap in MoM child responsible for file staging |
PBS-26731 | memory leak observed during stess/long running testing |
PBS-26977 | address github tracker 1469: pbs_mom core dump in post_reply() |
PBS-23946 | Race condition in receiving resources_used from sister MoM and sending to server |
PBS-24146 | address github tracker 1469: pbs_mom core dump in post_reply() |
PBS-28846 | Server deadlocks because requests issued by server to itself using the network |
In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.
Are you sure you think that any of the above bug fixes no longer cause downs?
Hi Jake,
Thanks for the reply.
Unfortunately, I opened the firewall, but PBS Down occurred.
However, when I set only the PBS version to "2022.1.5" in the exact same environment, The Down no longer occurs.
Therefore, I checked the Bug Lists in the release notes in "19.2.6"->"2022.1.5" and excerpted the parts that may be relevant below.
PBS-20650 Address PP-822: Client not accepting response due to slow network hangs server PBS-23858 mom segfaults when doing a rerun with an epilogue script PBS-23873 Mom crashed when using release_nodes_on_stageout PBS-23907 Possible server crash in node_down_requeue PBS-24394 Buffer overflow corrupts heap in MoM child responsible for file staging PBS-26731 memory leak observed during stess/long running testing PBS-26977 address github tracker 1469: pbs_mom core dump in post_reply() PBS-23946 Race condition in receiving resources_used from sister MoM and sending to server PBS-24146 address github tracker 1469: pbs_mom core dump in post_reply() PBS-28846 Server deadlocks because requests issued by server to itself using the network In particular, the last "PBS-28846" is close to my environment, so I think it no longer goes down due to this bug fix.
Are you sure you think that any of the above bug fixes no longer cause downs?
Hi Takayoshi,
Without knowing what the underlying cause to this issue was, we would be unable to confirm if it was related to any of the listed bugs.
I understand you now have a working PBS install with version 2022.1.5, however if you run into this issue again please log a case on our support system and we will do our best to help you.
Here is the link in case you need it: https://support.altair.com/
Hi Takayoshi,
Please see the below ports which each daemon listens at for PBS v19.2.6
(pbs_resmon listens on this port)