Skip to content

Commit c64a157

Browse files
Add enter_maintenance and exit_maintenance roles
These roles can be used to place hosts into and remove them from maintenance. Co-Authored-By: Jack Hodgkiss <[email protected]>
1 parent ff22ac7 commit c64a157

File tree

7 files changed

+141
-0
lines changed

7 files changed

+141
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Roles:
1515
* [commands](roles/commands/README.md) for running arbitrary commands
1616
* [crush_rules](roles/crush_rules/README.md) for defining CRUSH rules
1717
* [ec_profiles](roles/ec_profiles/README.md) for defining EC profiles
18+
* [enter_maintenance](roles/enter_maintenance/README.md) for placing hosts into maintenance
19+
* [exit_maintenance](roles/exit_maintenance/README.md) for removing hosts from maintenance
1820
* [keys](roles/keys/README.md) for defining auth keys
1921
* [pools](roles/pools/README.md) for defining pools
2022

roles/enter_maintenance/README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# enter_maintenance
2+
3+
This role places Ceph hosts into maintenance mode using `cephadm`.
4+
5+
## Prerequisites
6+
7+
This role should be executed on one host at a time. This can be achieved by
8+
adding `serial: 1` to a play.
9+
10+
### Host prerequisites
11+
12+
* The role assumes target hosts connection over SSH with user that has passwordless sudo configured.
13+
* Either direct Internet access or private registry with desired Ceph image accessible to all hosts is required.
14+
15+
### Inventory
16+
17+
This role assumes the existence of the following groups:
18+
19+
* `mons`
20+
21+
with at least one host in it - see the `cephadm` role for more details.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
- name: Check if host can enter maintenance mode
3+
ansible.builtin.import_role:
4+
name: stackhpc.cephadm.commands
5+
vars:
6+
cephadm_commands:
7+
- "orch host ok-to-stop {{ ansible_facts.nodename }}"
8+
9+
# Annoyingly, 'ceph orch host ok-to-stop' does not exit non-zero when
10+
# it is not OK to stop, so we need to check for specific messages.
11+
- name: Assert that it is safe to stop host
12+
ansible.builtin.assert:
13+
that:
14+
# This one is seen for monitors
15+
- "'It is NOT safe' not in cephadm_commands_result.results[0].stderr"
16+
# This one is seen for OSDs
17+
- "'unsafe to stop' not in cephadm_commands_result.results[0].stderr"
18+
fail_msg: "{{ cephadm_commands_result.results[0].stderr }}"
19+
20+
- name: Fail over Ceph manager
21+
when: '"Cannot stop active Mgr daemon" in cephadm_commands_result.results[0].stderr'
22+
block:
23+
- name: Extract full name of active Ceph manager
24+
ansible.builtin.set_fact:
25+
active_ceph_mgr: "{{ cephadm_commands_result.results[0].stderr | split | last | replace(\"'\", '') }}"
26+
27+
- name: Ensure active manager has been switched to another node
28+
ansible.builtin.import_role:
29+
name: stackhpc.cephadm.commands
30+
vars:
31+
cephadm_commands:
32+
- "mgr fail {{ active_ceph_mgr }}"
33+
34+
- name: Ensure host is in maintenance mode
35+
ansible.builtin.import_role:
36+
name: stackhpc.cephadm.commands
37+
vars:
38+
cephadm_commands:
39+
- "orch host maintenance enter {{ ansible_facts.nodename }}"
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
- name: Assert that execution is serialised
3+
ansible.builtin.assert:
4+
that:
5+
- ansible_play_batch | length == 1
6+
fail_msg: >-
7+
Hosts must be placed into maintenance one at a time in order to first check
8+
whether it is safe to stop them.
9+
10+
- name: List hosts in maintenance
11+
ansible.builtin.import_role:
12+
name: stackhpc.cephadm.commands
13+
vars:
14+
cephadm_commands:
15+
- "orch host ls --format json-pretty --host_status maintenance"
16+
17+
# Entering maintenance fails if the host is already in maintenance.
18+
- name: Enter maintenance
19+
ansible.builtin.include_tasks: enter.yml
20+
when: ansible_facts.nodename not in cephadm_hosts_in_maintenance
21+
vars:
22+
cephadm_hosts_in_maintenance: >-
23+
{{ cephadm_commands_result.results[0].stdout |
24+
from_json |
25+
map(attribute='hostname') |
26+
list }}

roles/exit_maintenance/README.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# exit_maintenance
2+
3+
This role removes Ceph hosts from maintenance mode using `cephadm`.
4+
5+
## Prerequisites
6+
7+
This role should be executed on one host at a time. This can be achieved by
8+
adding `serial: 1` to a play.
9+
10+
### Host prerequisites
11+
12+
* The role assumes target hosts connection over SSH with user that has passwordless sudo configured.
13+
* Either direct Internet access or private registry with desired Ceph image accessible to all hosts is required.
14+
15+
### Inventory
16+
17+
This role assumes the existence of the following groups:
18+
19+
* `mons`
20+
21+
with at least one host in it - see the `cephadm` role for more details.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
- name: Ensure host has exited maintenance mode
3+
ansible.builtin.import_role:
4+
name: stackhpc.cephadm.commands
5+
vars:
6+
cephadm_commands:
7+
- "orch host maintenance exit {{ ansible_facts.nodename }}"
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
- name: Assert that execution is serialised
3+
ansible.builtin.assert:
4+
that:
5+
- ansible_play_batch | length == 1
6+
fail_msg: >-
7+
Hosts must be removed from maintenance one at a time.
8+
9+
- name: List hosts
10+
ansible.builtin.import_role:
11+
name: stackhpc.cephadm.commands
12+
vars:
13+
cephadm_commands:
14+
- "orch host ls --format json-pretty"
15+
16+
# Exiting maintenance fails if the host is not in maintenance or offline.
17+
- name: Exit maintenance
18+
ansible.builtin.include_tasks: exit.yml
19+
when: cephadm_host_status.status | lower in ["maintenance", "offline"]
20+
vars:
21+
cephadm_host_status: >-
22+
{{ cephadm_commands_result.results[0].stdout |
23+
from_json |
24+
selectattr('hostname', 'equalto', ansible_facts.nodename) |
25+
first }}

0 commit comments

Comments
 (0)