Skip to content

Commit 02c9683

Browse files
authored
Merge branch 'develop' into efa
2 parents f1d02a0 + 4e23e3f commit 02c9683

File tree

3 files changed

+5
-0
lines changed

3 files changed

+5
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
2020
- Libfabric-aws: `libfabric-aws-1.22.0-1`
2121
- Rdma-core: `rdma-core-54.0-1`
2222
- Open MPI: `openmpi40-aws-4.1.7-1` and `openmpi50-aws-5.0.5`
23+
- Auto-restart slurmctld on failure.
2324

2425
**BUG FIXES**
2526
- Fix an issue in the way we get region when manage volumes so that it can correctly handle local zone.

cookbooks/aws-parallelcluster-slurm/spec/unit/recipes/config_slurmctld_systemd_service_spec.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@
3232
it 'creates the service definition for slurmctld with the correct settings' do
3333
is_expected.to render_file('/etc/systemd/system/slurmctld.service')
3434
.with_content("After=network-online.target munge.service remote-fs.target")
35+
.with_content("Restart=on-failure")
36+
.with_content("RestartSec=1s")
3537
end
3638
end
3739
end

cookbooks/aws-parallelcluster-slurm/templates/default/slurm/head_node/slurmctld.service.erb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ ExecReload=/bin/kill -HUP $MAINPID
1212
LimitNOFILE=562930
1313
LimitMEMLOCK=infinity
1414
LimitSTACK=infinity
15+
Restart=on-failure
16+
RestartSec=1s
1517

1618
[Install]
1719
WantedBy=multi-user.target

0 commit comments

Comments
 (0)