Skip to content

Commit 050af5e

Browse files
committed
review all roles for compute_init_enable
1 parent 00384ff commit 050af5e

File tree

1 file changed

+103
-24
lines changed

1 file changed

+103
-24
lines changed

ansible/roles/compute_init/README.md

Lines changed: 103 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,104 @@
1-
# EXPERIMENTAL: compute-init
2-
3-
Experimental / in-progress functionality to allow compute nodes to rejoin the
4-
cluster after a reboot.
5-
6-
To enable this add compute nodes (or a subset of them into) the `compute_init`
7-
group.
8-
1+
# EXPERIMENTAL: compute_init
2+
3+
Experimental functionality to allow compute nodes to rejoin the cluster after
4+
a reboot without running the `ansible/site.yml` playbook.
5+
6+
To enable this:
7+
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
8+
the default when using cookiecutter to create an environment, via the
9+
"everything" template.
10+
2. Build an image which includes the `compute_init` group. This is the case
11+
for StackHPC-built release images.
12+
3. Enable the required functionalities during boot, by setting the
13+
`compute_init_enable` property for a compute group in the
14+
OpenTofu `compute` variable to a list which includes "compute", plus the
15+
other roles/functionalities required, e.g.:
16+
17+
```terraform
18+
...
19+
compute = {
20+
general = {
21+
nodes = ["general-0", "general-1"]
22+
compute_init_enable = ["compute", ... ] # see below
23+
}
24+
}
25+
...
26+
```
27+
28+
## Supported appliance functionalities
29+
30+
The string "compute" must be present in the `compute_init_enable` flag to enable
31+
this functionality. The table below shows which other appliance functionalities
32+
are currently supported - use the name in the role column to enable these.
33+
34+
| Playbook | Role (or functionality) | Support |
35+
| -------------------------|-------------------------|-----------------|
36+
| hooks/pre.yml | ? | None at present |
37+
| validate.yml | n/a | Not relevant during boot |
38+
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
39+
| bootstrap.yml | resolv_conf | Fully supported |
40+
| bootstrap.yml | etc_hosts | Fully supported |
41+
| bootstrap.yml | proxy | None at present |
42+
| bootstrap.yml | (/etc permissions) | None required - use image build |
43+
| bootstrap.yml | (ssh /home fix) | None required - use image build |
44+
| bootstrap.yml | (system users) | None required - use image build |
45+
| bootstrap.yml | systemd | None required - use image build |
46+
| bootstrap.yml | selinux | None required - use image build |
47+
| bootstrap.yml | sshd | None at present |
48+
| bootstrap.yml | dnf_repos | None at present (requirement TBD) |
49+
| bootstrap.yml | squid | Not relevant for compute nodes |
50+
| bootstrap.yml | tuned | None |
51+
| bootstrap.yml | freeipa_server | Not relevant for compute nodes |
52+
| bootstrap.yml | cockpit | None required - use image build |
53+
| bootstrap.yml | firewalld | Not relevant for compute nodes |
54+
| bootstrap.yml | fail2ban | Not relevant for compute nodes |
55+
| bootstrap.yml | podman | Not relevant for compute nodes |
56+
| bootstrap.yml | update | Not relevant during boot |
57+
| bootstrap.yml | reboot | Not relevant for compute nodes |
58+
| bootstrap.yml | ofed | Not relevant during boot |
59+
| bootstrap.yml | ansible_init (install) | Not relevant during boot |
60+
| bootstrap.yml | k3s (install) | Not relevant during boot |
61+
| hooks/post-bootstrap.yml | ? | None at present |
62+
| iam.yml | freeipa_client | None at present [1] |
63+
| iam.yml | freeipa_server | Not relevant for compute nodes |
64+
| iam.yml | sssd | None at present |
65+
| filesystems.yml | block_devices | None required - role deprecated |
66+
| filesystems.yml | nfs | All client functionality |
67+
| filesystems.yml | manila | All functionality |
68+
| filesystems.yml | lustre | None at present |
69+
| extras.yml | basic_users | All functionality [2] |
70+
| extras.yml | eessi | All functionality [3] |
71+
| extras.yml | cuda | None required - use image build [4] |
72+
| extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
73+
| extras.yml | compute_init (export) | Not relevant for compute nodes |
74+
| extras.yml | k9s (install) | Not relevant during boot |
75+
| extras.yml | extra_packages | None at present. Would require dnf_repos |
76+
| slurm.yml | mysql | Not relevant for compute nodes |
77+
| slurm.yml | rebuild | Not relevant for compute nodes |
78+
| slurm.yml | openhpc [5] | All slurmd-related functionality |
79+
| slurm.yml | (set memory limits) | None at present |
80+
| slurm.yml | (block ssh) | None at present |
81+
| portal.yml | (openondemand server) | Not relevant for compute nodes |
82+
| portal.yml | (openondemand vnc desktop) | None required - use image build |
83+
| portal.yml | (openondemand jupyter server) | None required - use image build |
84+
| monitoring.yml | (all monitoring) | None at present [6] |
85+
| disable-repos.yml | dnf_repos | None at present (requirement TBD) |
86+
| hooks/post.yml | ? | None at present |
87+
88+
89+
Notes:
90+
1. FreeIPA client functionality would be better provided using a client fork
91+
which uses pkinit keys rather than OTP to reenrol nodes.
92+
2. Assumes home directory already exists on shared storage.
93+
3. Assumes `cvmfs_config` is the same on control node and all compute nodes
94+
4. If `cuda` role was run during build, the nvidia-persistenced is enabled
95+
and will start during boot.
96+
5. `openhpc` does not need to be added to `compute_init_enable`, this is
97+
automatically enabled by adding `compute`.
98+
5. Only node-exporter tasks are relevant, and will be done via k3s in a future release.
99+
100+
101+
## Approach
9102
This works as follows:
10103
1. During image build, an ansible-init playbook and supporting files
11104
(e.g. templates, filters, etc) are installed.
@@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
31124
to configure the node before the services on the control node are available
32125
(which requires running the site.yml playbook).
33126

34-
The following roles/groups are currently fully functional:
35-
- `resolv_conf`: all functionality
36-
- `etc_hosts`: all functionality
37-
- `nfs`: client functionality only
38-
- `manila`: all functionality
39-
- `basic_users`: all functionality, assumes home directory already exists on
40-
shared storage
41-
- `eessi`: all functionality, assumes `cvmfs_config` is the same on control
42-
node and all compute nodes.
43-
- `openhpc`: all functionality
44-
45-
The above may be enabled by setting the compute_init_enable property on the
46-
tofu compute variable.
47-
48-
# Development/debugging
127+
## Development/debugging
49128

50129
To develop/debug changes to the compute script without actually having to build
51130
a new image:
@@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
83162
as in step 3.
84163

85164

86-
# Design notes
165+
## Design notes
87166
- Duplicating code in roles into the `compute-init` script is unfortunate, but
88167
does allow developing this functionality without wider changes to the
89168
appliance.

0 commit comments

Comments
 (0)