@@ -21,83 +21,46 @@ To enable node health checks, ensure the `nhc` group contains the `compute` grou
21
21
compute
22
22
```
23
23
24
- This will automatically:
24
+ When the ` anisble/site.yml ` playbook is run this will automatically:
25
25
1 . Add NHC-related configuration to the ` slurm.conf ` Slurm configuration file.
26
- The default configuration is defined in ` openhpc_config_nhc `
27
- (see [ environments/common/inventory/group_vars/all/openhpc.yml] ( ../../../environments/common/inventory/group_vars/all/openhpc.yml ) ).
28
- It will run healthchecks on all ` IDLE ` nodes which are not ` DRAINED ` or ` NOT_RESPONDING `
29
- every 300 seconds. See [ slurm.conf parameters] ( https://slurm.schedmd.com/slurm.conf.html )
30
- ` HealthCheckInterval ` , ` HealthCheckNodeState ` , ` HealthCheckProgram ` . These may
31
- be overriden if required by redefining ` openhpc_config_nhc ` in e.g.
32
- ` environments/site/inventory/group_vars/nhc/yml ` .
33
-
34
- 2 . Define a default configuration for health checks for each compute node
35
- individually using [ nhc-genconf] ( https://github.com/mej/nhc?tab=readme-ov-file#config-file-auto-generation )
36
- The generated checks include:
26
+ The default configuration is defined in ` openhpc_config_nhc `
27
+ (see [ environments/common/inventory/group_vars/all/openhpc.yml] ( ../../../environments/common/inventory/group_vars/all/openhpc.yml ) ).
28
+ It will run healthchecks on all ` IDLE ` nodes which are not ` DRAINED ` or
29
+ ` NOT_RESPONDING ` every 300 seconds. See [ slurm.conf parameters] ( https://slurm.schedmd.com/slurm.conf.html )
30
+ ` HealthCheckInterval ` , ` HealthCheckNodeState ` , ` HealthCheckProgram ` . These
31
+ may be overriden if required by redefining ` openhpc_config_nhc ` in e.g.
32
+ ` environments/site/inventory/group_vars/nhc/yml ` .
33
+
34
+ 2 . Template out node health check rules using Ansible facts for each compute
35
+ node. Currently these check:
37
36
- Filesystem mounts
38
- - Filesystem space
39
- - CPU info
40
- - Memory and swap
41
- - Network interfaces
42
- - Various processes
37
+ - Ethernet interfaces
43
38
44
39
See ` /etc/nhc/nhc.conf ` on a compute node for the full configuration.
45
40
46
- The automatically generated checks may be modified or disabled using the
47
- ` nhc_replacements ` role variable described below.
48
-
49
41
If a node healthcheck run fails, Slurm will mark the node ` DOWN ` . With the
50
42
default [ alerting configuration] ( ../../../docs/alerting.md ) this will trigger
51
43
an alert.
52
44
53
- ## Updating Health Checks
54
-
55
- The above approach assumes that when the ` site.yml ` playbook is run all nodes
56
- are functioning correctly. Therefore if changes are made to aspects covered by
57
- the healthchecks (see above) without re-running this playbook, use the following
58
- to update the autogenerated health checks:
59
-
60
- ``` shell
61
- ansible-playbook ansible/extras.yml --tags nhc
62
- ```
63
-
64
45
## Role Variables
65
46
66
- - ` nhc_config_changes ` : Optional, default empty list. A list of mappings
67
- defining replacements in the autogenerated health checks. Mappings may have
68
- any keys from [ ansible.builtin.lineinfile] ( https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html ) .
69
- Note that the NHC [ configuration line format] ( https://github.com/mej/nhc?tab=readme-ov-file#configuration-file-syntax ) is:
70
-
71
- TARGET || CHECK
72
-
73
- where for autogenerated checks ` TARGET ` is the hostname. So a regex like:
74
-
75
- '^(\s+\S+\s+)\|\|(\s+.*)$'
76
-
77
- captures the ` TARGET ` and ` || ` separator as ` \1 ` and the actual check as ` \2 ` .
78
- For example the following configuration will comment-out checks for a
79
- specific interface for all nodes - in reality using ` state: absent ` would be
80
- better if the check is not required, but this shows the regex syntax:
81
-
82
- ``` yaml
83
- nhc_config_changes :
84
- - regexp : ' ^(\s+\S+\s+\|\|\s+)(check_hw_eth eth0)$'
85
- line : ' #\1\2'
86
- backrefs : yes
87
- ` ` `
47
+ - ` nhc_config_template ` : Template to use. Default is the in-role template
48
+ providing rules described above.
49
+ - ` nhc_config_extra ` : Possibly multiline string defining [ additional rules] ( https://github.com/mej/nhc/blob/master/README.md ) to
50
+ add. Jinja templating may be used. Default is empty string.
88
51
89
52
## Structure
90
53
91
54
This role contains 3x task files, which run at different times:
92
- - ` main.yml`: Runs from `site.yml` -> `slurm.yml`. Generates health check
93
- configuration.
94
- - `export.yml` : Runs from `site.yml` -> `extras .yml` via role `compute_init`
95
- tasks `export.yml`. Copies the generated health check configuration to the
96
- control node NFS share for compute-init.
97
- - `import .yml` : Runs on boot via `compute_init/files/compute-init.yml` and
98
- copies the node's generated health check configuration from the control node
99
- NFS share to local disk.
55
+ - ` main.yml ` : Runs from ` site.yml ` -> ` slurm.yml ` . Templates health check
56
+ configuration to nodes .
57
+ - ` export.yml ` : Runs from ` site.yml ` -> ` final .yml` via role ` compute_init `
58
+ tasks ` export.yml ` . Templates health check configuration to the cluster NFS
59
+ share for compute-init.
60
+ - ` boot .yml` : Runs on boot via ` compute_init/files/compute-init.yml ` . Copies
61
+ the node's generated health check configuration from the cluster share to
62
+ local disk.
100
63
101
64
Note that the ` stackhpc.openhpc ` role:
102
65
- Installs the required package
103
- - Configures slurm
66
+ - Configures slurm.conf parameterss
0 commit comments