|
| 1 | +# Node Health Checks (nhc) |
| 2 | + |
| 3 | +Deploys and configures the LBNL [Node Health Check](https://github.com/mej/nhc) |
| 4 | +(NHC) which will put nodes in `DOWN` state if they fail periodic checks on |
| 5 | +various aspects. |
| 6 | + |
| 7 | +Due to the integration with Slurm this is tightly linked to the configuration |
| 8 | +for the [stackhpc.openhpc](../stackhpc.openhpc/README.md) role. |
| 9 | + |
| 10 | +## Enabling |
| 11 | + |
| 12 | +By [default](../../../environments/common/inventory/group_vars/all/openhpc.yml) |
| 13 | +the required `nhc-ohpc` packages are installed in all images. |
| 14 | + |
| 15 | +To enable node health checks, ensure the `nhc` group contains the `compute` group: |
| 16 | + |
| 17 | +```yaml |
| 18 | +# environments/site/inventory/groups: |
| 19 | +[nhc:children] |
| 20 | +# Hosts to configure for node health checks |
| 21 | +compute |
| 22 | +``` |
| 23 | + |
| 24 | +This will: |
| 25 | +1. Add NHC-related configuration to the `slurm.conf` Slurm configuration file. |
| 26 | +The default configuration is defined in `openhpc_config_nhc` |
| 27 | +(see [environments/common/inventory/group_vars/all/openhpc.yml](../../../environments/common/inventory/group_vars/all/openhpc.yml)). |
| 28 | +It will run healthchecks on all `IDLE` nodes which are not `DRAINED` or `NOT_RESPONDING` |
| 29 | +every 300 seconds. See [slurm.conf parameters](https://slurm.schedmd.com/slurm.conf.html) |
| 30 | +`HealthCheckInterval`, `HealthCheckNodeState`, `HealthCheckProgram`. These may |
| 31 | +be overriden if required by redefining `openhpc_config_nhc` in e.g. |
| 32 | +`environments/site/inventory/group_vars/nhc/yml`. |
| 33 | + |
| 34 | +2. Define a default configuration for health checks for each compute node |
| 35 | +individually using [nhc-genconf](https://github.com/mej/nhc?tab=readme-ov-file#config-file-auto-generation) |
| 36 | +The generated checks include: |
| 37 | + - Filesystem mounts |
| 38 | + - Filesystem space |
| 39 | + - CPU info |
| 40 | + - Memory and swap |
| 41 | + - Network interfaces |
| 42 | + - Various processes |
| 43 | + |
| 44 | + See `/etc/nhc/nhc.conf` on a compute node for the full configuration. |
| 45 | + |
| 46 | +The automatically generated checks may be modified or disabled using the |
| 47 | +`nhc_replacements` role variable described below. |
| 48 | + |
| 49 | +If a node healthcheck run fails, Slurm will mark the node `DOWN`. With the |
| 50 | +default [alerting configuration](../../../docs/alerting.md) this will trigger |
| 51 | +an alert. |
| 52 | + |
| 53 | +## Updating Health Checks |
| 54 | + |
| 55 | +The above approach assumes that when the `site.yml` playbook is run all nodes |
| 56 | +are functioning correctly. Therefore if changes are made to aspects covered by |
| 57 | +the healthchecks (see above) without re-running this playbook, use the following |
| 58 | +to update the autogenerated health checks: |
| 59 | + |
| 60 | + ```shell |
| 61 | + ansible-playbook ansible/extras.yml --tags nhc |
| 62 | + ``` |
| 63 | + |
| 64 | +## Role Variables |
| 65 | + |
| 66 | +- `nhc_replacements`: Optional, default empty list. A list of mappings |
| 67 | + defining replacements in the autogenerated health checks. Items must have |
| 68 | + keys `regexp` and `replace` which are as for [ansible.builtin.replace](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/replace_module.html). |
| 69 | + Note that the NHC [configuration line format](https://github.com/mej/nhc?tab=readme-ov-file#configuration-file-syntax) is: |
| 70 | + |
| 71 | + TARGET || CHECK |
| 72 | + |
| 73 | + where for autogenerated checks `TARGET` is the hostname. So a regex like: |
| 74 | + |
| 75 | + '^(\s+\S+\s+)\|\|(\s+.*)$' |
| 76 | + |
| 77 | + captures the `TARGET` and `||` separator as `\1` and the actual check as `\2`. |
| 78 | + Hence the following item would comment-out checks on a particular interface |
| 79 | + on all nodes: |
| 80 | + |
| 81 | + - regexp: '^(\s+\S+\s+\|\|\s+)(check_hw_eth eth0)$' |
| 82 | + replace: '#\1\2' |
| 83 | + |
| 84 | + See documentation for `ansible.builtin.replace` for more information. This is |
| 85 | + an example only - for this actual case removing the line entirely with |
| 86 | + `replace: ''` might be better. Using https://regex101.com/ (in Python |
| 87 | + mode) or similar may be useful during development. |
| 88 | + |
| 89 | +- `nhc_replacements_default`: Optional. As above, but by default includes a |
| 90 | + mapping to remote the autogenerated timestamp line from the check configuration |
| 91 | + file for idempotency. |
| 92 | + |
| 93 | +## Structure |
| 94 | + |
| 95 | +This role contains 3x task files, which run at different times: |
| 96 | +- `main.yml`: Runs from `site.yml` -> `slurm.yml`. Generates health check |
| 97 | + configuration. |
| 98 | +- `export.yml`: Runs from `site.yml` -> `extras.yml` via role `compute_init` |
| 99 | + tasks `export.yml`. Copies the generated health check configuration to the |
| 100 | + control node NFS share for compute-init. |
| 101 | +- `import.yml`: Runs on boot via `compute_init/files/compute-init.yml` and |
| 102 | + copies the node's generated health check configuration from the control node |
| 103 | + NFS share to local disk. |
| 104 | + |
| 105 | +Note that the `stackhpc.openhpc` role: |
| 106 | +- Installs the required package |
| 107 | +- Configures slurm |
0 commit comments