Skip to content

Commit d6c8b44

Browse files
committed
Auto-restart slurmctld on failure after 1 second.
Signed-off-by: Giacomo Marciani <[email protected]>
1 parent e3b2269 commit d6c8b44

File tree

3 files changed

+5
-0
lines changed

3 files changed

+5
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
1313
- xdcv: `2024.0.631-1`
1414
- gl: `2024.0.1078-1`
1515
- web_viewer: `2024.0-18131-1`
16+
- Auto-restart slurmctld on failure.
1617

1718
**BUG FIXES**
1819
- Fix an issue in the way we get region when manage volumes so that it can correctly handle local zone.

cookbooks/aws-parallelcluster-slurm/spec/unit/recipes/config_slurmctld_systemd_service_spec.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@
3232
it 'creates the service definition for slurmctld with the correct settings' do
3333
is_expected.to render_file('/etc/systemd/system/slurmctld.service')
3434
.with_content("After=network-online.target munge.service remote-fs.target")
35+
.with_content("Restart=on-failure")
36+
.with_content("RestartSec=1s")
3537
end
3638
end
3739
end

cookbooks/aws-parallelcluster-slurm/templates/default/slurm/head_node/slurmctld.service.erb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ ExecReload=/bin/kill -HUP $MAINPID
1212
LimitNOFILE=562930
1313
LimitMEMLOCK=infinity
1414
LimitSTACK=infinity
15+
Restart=on-failure
16+
RestartSec=1s
1517

1618
[Install]
1719
WantedBy=multi-user.target

0 commit comments

Comments
 (0)