@@ -21,83 +21,46 @@ To enable node health checks, ensure the `nhc` group contains the `compute` grou
2121compute
2222```
2323
24- This will automatically:
24+ When the ` anisble/site.yml ` playbook is run this will automatically:
25251 . Add NHC-related configuration to the ` slurm.conf ` Slurm configuration file.
26- The default configuration is defined in ` openhpc_config_nhc `
27- (see [ environments/common/inventory/group_vars/all/openhpc.yml] ( ../../../environments/common/inventory/group_vars/all/openhpc.yml ) ).
28- It will run healthchecks on all ` IDLE ` nodes which are not ` DRAINED ` or ` NOT_RESPONDING `
29- every 300 seconds. See [ slurm.conf parameters] ( https://slurm.schedmd.com/slurm.conf.html )
30- ` HealthCheckInterval ` , ` HealthCheckNodeState ` , ` HealthCheckProgram ` . These may
31- be overriden if required by redefining ` openhpc_config_nhc ` in e.g.
32- ` environments/site/inventory/group_vars/nhc/yml ` .
33-
34- 2 . Define a default configuration for health checks for each compute node
35- individually using [ nhc-genconf] ( https://github.com/mej/nhc?tab=readme-ov-file#config-file-auto-generation )
36- The generated checks include:
26+ The default configuration is defined in ` openhpc_config_nhc `
27+ (see [ environments/common/inventory/group_vars/all/openhpc.yml] ( ../../../environments/common/inventory/group_vars/all/openhpc.yml ) ).
28+ It will run healthchecks on all ` IDLE ` nodes which are not ` DRAINED ` or
29+ ` NOT_RESPONDING ` every 300 seconds. See [ slurm.conf parameters] ( https://slurm.schedmd.com/slurm.conf.html )
30+ ` HealthCheckInterval ` , ` HealthCheckNodeState ` , ` HealthCheckProgram ` . These
31+ may be overriden if required by redefining ` openhpc_config_nhc ` in e.g.
32+ ` environments/site/inventory/group_vars/nhc/yml ` .
33+
34+ 2 . Template out node health check rules using Ansible facts for each compute
35+ node. Currently these check:
3736 - Filesystem mounts
38- - Filesystem space
39- - CPU info
40- - Memory and swap
41- - Network interfaces
42- - Various processes
37+ - Ethernet interfaces
4338
4439 See ` /etc/nhc/nhc.conf ` on a compute node for the full configuration.
4540
46- The automatically generated checks may be modified or disabled using the
47- ` nhc_replacements ` role variable described below.
48-
4941If a node healthcheck run fails, Slurm will mark the node ` DOWN ` . With the
5042default [ alerting configuration] ( ../../../docs/alerting.md ) this will trigger
5143an alert.
5244
53- ## Updating Health Checks
54-
55- The above approach assumes that when the ` site.yml ` playbook is run all nodes
56- are functioning correctly. Therefore if changes are made to aspects covered by
57- the healthchecks (see above) without re-running this playbook, use the following
58- to update the autogenerated health checks:
59-
60- ``` shell
61- ansible-playbook ansible/extras.yml --tags nhc
62- ```
63-
6445## Role Variables
6546
66- - ` nhc_config_changes ` : Optional, default empty list. A list of mappings
67- defining replacements in the autogenerated health checks. Mappings may have
68- any keys from [ ansible.builtin.lineinfile] ( https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html ) .
69- Note that the NHC [ configuration line format] ( https://github.com/mej/nhc?tab=readme-ov-file#configuration-file-syntax ) is:
70-
71- TARGET || CHECK
72-
73- where for autogenerated checks ` TARGET ` is the hostname. So a regex like:
74-
75- '^(\s+\S+\s+)\|\|(\s+.*)$'
76-
77- captures the ` TARGET ` and ` || ` separator as ` \1 ` and the actual check as ` \2 ` .
78- For example the following configuration will comment-out checks for a
79- specific interface for all nodes - in reality using ` state: absent ` would be
80- better if the check is not required, but this shows the regex syntax:
81-
82- ``` yaml
83- nhc_config_changes :
84- - regexp : ' ^(\s+\S+\s+\|\|\s+)(check_hw_eth eth0)$'
85- line : ' #\1\2'
86- backrefs : yes
87- ` ` `
47+ - ` nhc_config_template ` : Template to use. Default is the in-role template
48+ providing rules described above.
49+ - ` nhc_config_extra ` : Possibly multiline string defining [ additional rules] ( https://github.com/mej/nhc/blob/master/README.md ) to
50+ add. Jinja templating may be used. Default is empty string.
8851
8952## Structure
9053
9154This role contains 3x task files, which run at different times:
92- - ` main.yml`: Runs from `site.yml` -> `slurm.yml`. Generates health check
93- configuration.
94- - `export.yml` : Runs from `site.yml` -> `extras .yml` via role `compute_init`
95- tasks `export.yml`. Copies the generated health check configuration to the
96- control node NFS share for compute-init.
97- - `import .yml` : Runs on boot via `compute_init/files/compute-init.yml` and
98- copies the node's generated health check configuration from the control node
99- NFS share to local disk.
55+ - ` main.yml ` : Runs from ` site.yml ` -> ` slurm.yml ` . Templates health check
56+ configuration to nodes .
57+ - ` export.yml ` : Runs from ` site.yml ` -> ` final .yml` via role ` compute_init `
58+ tasks ` export.yml ` . Templates health check configuration to the cluster NFS
59+ share for compute-init.
60+ - ` boot .yml` : Runs on boot via ` compute_init/files/compute-init.yml ` . Copies
61+ the node's generated health check configuration from the cluster share to
62+ local disk.
10063
10164Note that the ` stackhpc.openhpc ` role:
10265- Installs the required package
103- - Configures slurm
66+ - Configures slurm.conf parameterss
0 commit comments