Skip to content

Commit 59e1222

Browse files
committed
Stop slurmd during slurmctld restart
Some Slurm configuration changes can cause compute nodes to go into invalid state if slurmctld is restarted while slurmd services are still running. Stop slurmd services while slurmctld is being restarted. This has been tested not to affect running jobs. Closes #199
1 parent 34d3996 commit 59e1222

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

tasks/runtime.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,25 @@
127127
- "_openhpc_slurmdbd_state.stdout == 'inactive'"
128128
- openhpc_enable.database | default(false)
129129

130+
- name: Stop slurmd if configuration has changed
131+
service:
132+
name: "slurmd"
133+
state: stopped
134+
retries: 5
135+
register: slurmd_stop
136+
until: slurmd_stop is success
137+
delay: 30
138+
when:
139+
- openhpc_slurm_service_started | bool
140+
- openhpc_enable.batch | default(false) | bool
141+
- openhpc_slurm_control_host in ansible_play_hosts
142+
- hostvars[openhpc_slurm_control_host].ohpc_slurm_conf.changed or
143+
hostvars[openhpc_slurm_control_host].ohpc_cgroup_conf.changed or
144+
hostvars[openhpc_slurm_control_host].ohpc_gres_conf.changed # noqa no-handler
145+
146+
- name: Flush handler
147+
meta: flush_handlers # This will restart slurmctld while slurmd services are stopped, if needed
148+
130149
- name: Notify handler for slurmd restart
131150
debug:
132151
msg: "notifying handlers" # meta: noop doesn't support 'when'

0 commit comments

Comments
 (0)