1
- # EXPERIMENTAL: compute-init
2
-
3
- Experimental / in-progress functionality to allow compute nodes to rejoin the
4
- cluster after a reboot.
5
-
6
- To enable this add compute nodes (or a subset of them into) the ` compute_init `
7
- group.
8
-
1
+ # EXPERIMENTAL: compute_init
2
+
3
+ Experimental functionality to allow compute nodes to rejoin the cluster after
4
+ a reboot without running the ` ansible/site.yml ` playbook.
5
+
6
+ To enable this:
7
+ 1 . Add the ` compute ` group (or a subset) into the ` compute_init ` group. This is
8
+ the default when using cookiecutter to create an environment, via the
9
+ "everything" template.
10
+ 2 . Build an image which includes the ` compute_init ` group. This is the case
11
+ for StackHPC-built release images.
12
+ 3 . Enable the required functionalities during boot, by setting the
13
+ ` compute_init_enable ` property for a compute group in the
14
+ OpenTofu ` compute ` variable to a list which includes "compute", plus the
15
+ other roles/functionalities required, e.g.:
16
+
17
+ ``` terraform
18
+ ...
19
+ compute = {
20
+ general = {
21
+ nodes = ["general-0", "general-1"]
22
+ compute_init_enable = ["compute", ... ] # see below
23
+ }
24
+ }
25
+ ...
26
+ ```
27
+
28
+ ## Supported appliance functionalities
29
+
30
+ In the table below, if a role is marked as supported then its functionality
31
+ can be enabled during boot by adding the role name to the ` compute_init_enable `
32
+ property described above. If a role is marked as requiring a custom image then
33
+ it also requires an image build with the role name added to the
34
+ [ Packer inventory_groups variable] ( ../../../docs/image-build.md ) .
35
+
36
+ | Playbook | Role (or functionality) | Support | Custom image reqd.? |
37
+ | -------------------------| -------------------------| ---------------------------------| ---------------------|
38
+ | hooks/pre.yml | ? | None at present | n/a |
39
+ | validate.yml | n/a | Not relevant during boot | n/a |
40
+ | bootstrap.yml | (wait for ansible-init) | Not relevant during boot | n/a |
41
+ | bootstrap.yml | resolv_conf | Fully supported | No |
42
+ | bootstrap.yml | etc_hosts | Fully supported | No |
43
+ | bootstrap.yml | proxy | None at present | No |
44
+ | bootstrap.yml | (/etc permissions) | None required - use image build | No |
45
+ | bootstrap.yml | (ssh /home fix) | None required - use image build | No |
46
+ | bootstrap.yml | (system users) | None required - use image build | No |
47
+ | bootstrap.yml | systemd | None required - use image build | No |
48
+ | bootstrap.yml | selinux | None required - use image build | Maybe [ 1] |
49
+ | bootstrap.yml | sshd | None at present | No |
50
+ | bootstrap.yml | dnf_repos | None at present [ 2] | - |
51
+ | bootstrap.yml | squid | Not relevant for compute nodes | n/a |
52
+ | bootstrap.yml | tuned | Fully supported | No |
53
+ | bootstrap.yml | freeipa_server | Not relevant for compute nodes | n/a |
54
+ | bootstrap.yml | cockpit | None required - use image build | No |
55
+ | bootstrap.yml | firewalld | Not relevant for compute nodes | n/a |
56
+ | bootstrap.yml | fail2ban | Not relevant for compute nodes | n/a |
57
+ | bootstrap.yml | podman | Not relevant for compute nodes | n/a |
58
+ | bootstrap.yml | update | Not relevant during boot | n/a |
59
+ | bootstrap.yml | reboot | Not relevant for compute nodes | n/a |
60
+ | bootstrap.yml | ofed | Not relevant during boot | Yes |
61
+ | bootstrap.yml | ansible_init (install) | Not relevant during boot | n/a |
62
+ | bootstrap.yml | k3s (install) | Not relevant during boot | n/a |
63
+ | hooks/post-bootstrap.yml | ? | None at present | n/a |
64
+ | iam.yml | freeipa_client | None at present [ 3] | Yes |
65
+ | iam.yml | freeipa_server | Not relevant for compute nodes | n/a |
66
+ | iam.yml | sssd | None at present | No |
67
+ | filesystems.yml | block_devices | None required - role deprecated | n/a |
68
+ | filesystems.yml | nfs | All client functionality | No |
69
+ | filesystems.yml | manila | All functionality | No [ 4] |
70
+ | filesystems.yml | lustre | None at present | Yes |
71
+ | extras.yml | basic_users | All functionality [ 5] | No |
72
+ | extras.yml | eessi | All functionality [ 6] | No |
73
+ | extras.yml | cuda | None required - use image build | Yes [ 7] |
74
+ | extras.yml | persist_hostkeys | Not relevant for compute nodes | n/a |
75
+ | extras.yml | compute_init (export) | Not relevant for compute nodes | n/a |
76
+ | extras.yml | k9s (install) | Not relevant during boot | n/a |
77
+ | extras.yml | extra_packages | None at present [ 8] | - |
78
+ | slurm.yml | mysql | Not relevant for compute nodes | n/a |
79
+ | slurm.yml | rebuild | Not relevant for compute nodes | n/a |
80
+ | slurm.yml | openhpc [ 9] | All slurmd functionality | No |
81
+ | slurm.yml | (set memory limits) | None at present | - |
82
+ | slurm.yml | (block ssh) | None at present | - |
83
+ | portal.yml | (openondemand server) | Not relevant for compute nodes | n/a |
84
+ | portal.yml | (openondemand vnc desktop) | None required - use image build | No |
85
+ | portal.yml | (openondemand jupyter server) | None required - use image build | No |
86
+ | monitoring.yml | node_exporter | None required - use image build | No |
87
+ | monitoring.yml | (other monitoring) | Not relevant for compute nodes | - |
88
+ | disable-repos.yml | dnf_repos | None at present [ 2] | - |
89
+ | hooks/post.yml | ? | None at present | - |
90
+
91
+
92
+ Notes:
93
+ 1 . ` selinux ` is set to disabled in StackHPC images.
94
+ 2 . Requirement for this functionality is TBD.
95
+ 3 . FreeIPA client functionality would be better provided using a client fork
96
+ which uses pkinit keys rather than OTP to reenrol nodes.
97
+ 4 . Assuming default Ceph client version.
98
+ 5 . Assumes home directory already exists on shared storage.
99
+ 6 . Assumes ` cvmfs_config ` is the same on control node and all compute nodes.
100
+ 7 . If ` cuda ` role was run during build, the nvidia-persistenced is enabled
101
+ and will start during boot.
102
+ 8 . Would require ` dnf_repos ` .
103
+ 9 . ` openhpc ` does not need to be added to ` compute_init_enable ` , this is
104
+ automatically enabled by adding ` compute ` .
105
+
106
+ ## Approach
9
107
This works as follows:
10
108
1 . During image build, an ansible-init playbook and supporting files
11
109
(e.g. templates, filters, etc) are installed.
@@ -31,21 +129,7 @@ The check in 4b. above is what prevents the compute-init script from trying
31
129
to configure the node before the services on the control node are available
32
130
(which requires running the site.yml playbook).
33
131
34
- The following roles/groups are currently fully functional:
35
- - ` resolv_conf ` : all functionality
36
- - ` etc_hosts ` : all functionality
37
- - ` nfs ` : client functionality only
38
- - ` manila ` : all functionality
39
- - ` basic_users ` : all functionality, assumes home directory already exists on
40
- shared storage
41
- - ` eessi ` : all functionality, assumes ` cvmfs_config ` is the same on control
42
- node and all compute nodes.
43
- - ` openhpc ` : all functionality
44
-
45
- The above may be enabled by setting the compute_init_enable property on the
46
- tofu compute variable.
47
-
48
- # Development/debugging
132
+ ## Development/debugging
49
133
50
134
To develop/debug changes to the compute script without actually having to build
51
135
a new image:
@@ -83,7 +167,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
83
167
as in step 3.
84
168
85
169
86
- # Design notes
170
+ ## Design notes
87
171
- Duplicating code in roles into the ` compute-init ` script is unfortunate, but
88
172
does allow developing this functionality without wider changes to the
89
173
appliance.
0 commit comments