Skip to content

Commit aff28f7

Browse files
committed
change NHC to use templating instead of autoconfiguration
1 parent bf632f1 commit aff28f7

File tree

6 files changed

+49
-99
lines changed

6 files changed

+49
-99
lines changed

ansible/roles/nhc/README.md

Lines changed: 25 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -21,83 +21,46 @@ To enable node health checks, ensure the `nhc` group contains the `compute` grou
2121
compute
2222
```
2323

24-
This will automatically:
24+
When the `anisble/site.yml` playbook is run this will automatically:
2525
1. Add NHC-related configuration to the `slurm.conf` Slurm configuration file.
26-
The default configuration is defined in `openhpc_config_nhc`
27-
(see [environments/common/inventory/group_vars/all/openhpc.yml](../../../environments/common/inventory/group_vars/all/openhpc.yml)).
28-
It will run healthchecks on all `IDLE` nodes which are not `DRAINED` or `NOT_RESPONDING`
29-
every 300 seconds. See [slurm.conf parameters](https://slurm.schedmd.com/slurm.conf.html)
30-
`HealthCheckInterval`, `HealthCheckNodeState`, `HealthCheckProgram`. These may
31-
be overriden if required by redefining `openhpc_config_nhc` in e.g.
32-
`environments/site/inventory/group_vars/nhc/yml`.
33-
34-
2. Define a default configuration for health checks for each compute node
35-
individually using [nhc-genconf](https://github.com/mej/nhc?tab=readme-ov-file#config-file-auto-generation)
36-
The generated checks include:
26+
The default configuration is defined in `openhpc_config_nhc`
27+
(see [environments/common/inventory/group_vars/all/openhpc.yml](../../../environments/common/inventory/group_vars/all/openhpc.yml)).
28+
It will run healthchecks on all `IDLE` nodes which are not `DRAINED` or
29+
`NOT_RESPONDING` every 300 seconds. See [slurm.conf parameters](https://slurm.schedmd.com/slurm.conf.html)
30+
`HealthCheckInterval`, `HealthCheckNodeState`, `HealthCheckProgram`. These
31+
may be overriden if required by redefining `openhpc_config_nhc` in e.g.
32+
`environments/site/inventory/group_vars/nhc/yml`.
33+
34+
2. Template out node health check rules using Ansible facts for each compute
35+
node. Currently these check:
3736
- Filesystem mounts
38-
- Filesystem space
39-
- CPU info
40-
- Memory and swap
41-
- Network interfaces
42-
- Various processes
37+
- Ethernet interfaces
4338

4439
See `/etc/nhc/nhc.conf` on a compute node for the full configuration.
4540

46-
The automatically generated checks may be modified or disabled using the
47-
`nhc_replacements` role variable described below.
48-
4941
If a node healthcheck run fails, Slurm will mark the node `DOWN`. With the
5042
default [alerting configuration](../../../docs/alerting.md) this will trigger
5143
an alert.
5244

53-
## Updating Health Checks
54-
55-
The above approach assumes that when the `site.yml` playbook is run all nodes
56-
are functioning correctly. Therefore if changes are made to aspects covered by
57-
the healthchecks (see above) without re-running this playbook, use the following
58-
to update the autogenerated health checks:
59-
60-
```shell
61-
ansible-playbook ansible/extras.yml --tags nhc
62-
```
63-
6445
## Role Variables
6546

66-
- `nhc_config_changes`: Optional, default empty list. A list of mappings
67-
defining replacements in the autogenerated health checks. Mappings may have
68-
any keys from [ansible.builtin.lineinfile](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html).
69-
Note that the NHC [configuration line format](https://github.com/mej/nhc?tab=readme-ov-file#configuration-file-syntax) is:
70-
71-
TARGET || CHECK
72-
73-
where for autogenerated checks `TARGET` is the hostname. So a regex like:
74-
75-
'^(\s+\S+\s+)\|\|(\s+.*)$'
76-
77-
captures the `TARGET` and `||` separator as `\1` and the actual check as `\2`.
78-
For example the following configuration will comment-out checks for a
79-
specific interface for all nodes - in reality using `state: absent` would be
80-
better if the check is not required, but this shows the regex syntax:
81-
82-
```yaml
83-
nhc_config_changes:
84-
- regexp: '^(\s+\S+\s+\|\|\s+)(check_hw_eth eth0)$'
85-
line: '#\1\2'
86-
backrefs: yes
87-
```
47+
- `nhc_config_template`: Template to use. Default is the in-role template
48+
providing rules described above.
49+
- `nhc_config_extra`: Possibly multiline string defining [additional rules](https://github.com/mej/nhc/blob/master/README.md) to
50+
add. Jinja templating may be used. Default is empty string.
8851

8952
## Structure
9053

9154
This role contains 3x task files, which run at different times:
92-
- `main.yml`: Runs from `site.yml` -> `slurm.yml`. Generates health check
93-
configuration.
94-
- `export.yml`: Runs from `site.yml` -> `extras.yml` via role `compute_init`
95-
tasks `export.yml`. Copies the generated health check configuration to the
96-
control node NFS share for compute-init.
97-
- `import.yml`: Runs on boot via `compute_init/files/compute-init.yml` and
98-
copies the node's generated health check configuration from the control node
99-
NFS share to local disk.
55+
- `main.yml`: Runs from `site.yml` -> `slurm.yml`. Templates health check
56+
configuration to nodes.
57+
- `export.yml`: Runs from `site.yml` -> `final.yml` via role `compute_init`
58+
tasks `export.yml`. Templates health check configuration to the cluster NFS
59+
share for compute-init.
60+
- `boot.yml`: Runs on boot via `compute_init/files/compute-init.yml`. Copies
61+
the node's generated health check configuration from the cluster share to
62+
local disk.
10063

10164
Note that the `stackhpc.openhpc` role:
10265
- Installs the required package
103-
- Configures slurm
66+
- Configures slurm.conf parameterss
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
nhc_config_changes: []
1+
nhc_config_template: nhc.conf.j2
2+
nhc_config_extra: ''
File renamed without changes.

ansible/roles/nhc/tasks/export.yml

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
1-
- name: Slurp generated NHC configuration
2-
ansible.builtin.slurp:
3-
src: /etc/nhc/nhc.conf
4-
register: _nhc_generated_conf
5-
6-
- name: Write generated NHC configuration to control node exports
7-
ansible.builtin.copy:
1+
# Used for compute-init
2+
- name: Template out host specific NHC config
3+
ansible.builtin.template:
4+
src: "{{ nhc_config_template }}"
85
dest: "/exports/cluster/hostconfig/{{ inventory_hostname }}/nhc.conf"
9-
content: "{{ _nhc_generated_conf.content | b64decode }}"
106
delegate_to: "{{ groups['control'] | first }}"

ansible/roles/nhc/tasks/main.yml

Lines changed: 3 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,7 @@
1-
- name: Create temporary file for autogenerated NHC configuration
2-
ansible.builtin.tempfile:
3-
register: nhc_tempfile
41

5-
- name: Run NHC autoconfiguration
6-
ansible.builtin.command:
7-
cmd: "nhc-genconf -c {{ nhc_tempfile.path }}"
8-
register: _nhc_genconf
9-
failed_when: _nhc_genconf.rc > 1 # https://github.com/mej/nhc/issues/158
10-
11-
- name: Apply modifications to autogenerated NHC configuration
12-
ansible.builtin.lineinfile: "{{ item | combine(nhc_lineinfile_defaults) }}"
13-
loop: "{{ nhc_config_changes }}"
14-
vars:
15-
nhc_lineinfile_defaults:
16-
path: "{{ nhc_tempfile.path }}"
17-
18-
- name: Remove timestamp from autogenerated NHC configuration
19-
# for idempotency
20-
ansible.builtin.replace:
21-
path: "{{ nhc_tempfile.path }}"
22-
# note this matches a multiline string:
23-
regexp: '# This file was automatically generated by nhc-genconf\n#.*'
24-
replace: '# This file was automatically generated by nhc-genconf\n# (timestamp removed for idempotency)'
25-
26-
- name: Copy modified NHC configuration to active location
27-
ansible.builtin.copy:
28-
remote_src: true
29-
src: "{{ nhc_tempfile.path }}"
2+
- name: Template out NHC configuration
3+
ansible.builtin.template:
4+
src: "{{ nhc_config_template }}"
305
dest: /etc/nhc/nhc.conf
316
owner: root
327
group: root
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# {{ ansible_managed }}
2+
3+
## Filesystem checks
4+
{% for mount in ansible_mounts %}
5+
{% set mount_mode = 'rw' if 'rw' in mount.options.split(',') else 'ro' %}
6+
{{ ansible_fqdn }} || check_fs_mount_{{ mount_mode }} -t "{{ mount.fstype }}" -s "{{ mount.device }}" -f "{{ mount.mount }}"
7+
{% endfor %}
8+
9+
## Ethernet interface checks
10+
{% for iface in ansible_interfaces | select('match', 'eth') %}
11+
{{ ansible_fqdn }} || check_hw_eth {{ iface }}
12+
{% endfor %}
13+
14+
## Site-specific checks
15+
{{ nhc_config_extra }}

0 commit comments

Comments
 (0)