-
Notifications
You must be signed in to change notification settings - Fork 314
Closed
Description
I had an issue with PC 3.11.1 slurmctld dying - it is looking like an issue at my end with DNS and reverse lookup (#6529), but while reviewing and thinking about how to mitigate my issue I modified the slurmctld systemctl file to restart slurmctld on failure so I don't lose my cluster.
Here is my new 3.11.1 slurmctld.service file:
# /etc/systemd/system/slurmctld.service
[Unit]
Description=Slurm controller daemon
After=network-online.target munge.service remote-fs.target
Wants=network-online.target
ConditionPathExists=/opt/slurm/etc/slurm.conf
StartLimitIntervalSec=30
StartLimitBurst=2
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStart=/opt/slurm/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=562930
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
RestartSec=10s
[Install]
WantedBy=multi-user.target
Four new lines added:
In the [Unit] section:
StartLimitIntervalSec=30
StartLimitBurst=2
and in the [Service] section:
Restart=on-failure
RestartSec=10s
Maybe something you might want to consider adding to standard distribution?