1
- # EXPERIMENTAL: compute-init
2
-
3
- Experimental / in-progress functionality to allow compute nodes to rejoin the
4
- cluster after a reboot.
5
-
6
- To enable this add compute nodes (or a subset of them into) the ` compute_init `
7
- group.
8
-
1
+ # EXPERIMENTAL: compute_init
2
+
3
+ Experimental functionality to allow compute nodes to rejoin the cluster after
4
+ a reboot without running the ` ansible/site.yml ` playbook.
5
+
6
+ To enable this:
7
+ 1 . Add the ` compute ` group (or a subset) into the ` compute_init ` group. This is
8
+ the default when using cookiecutter to create an environment, via the
9
+ "everything" template.
10
+ 2 . Build an image which includes the ` compute_init ` group. This is the case
11
+ for StackHPC-built release images.
12
+ 3 . Enable the required functionalities during boot, by setting the
13
+ ` compute_init_enable ` property for a compute group in the
14
+ OpenTofu ` compute ` variable to a list which includes "compute", plus the
15
+ other roles/functionalities required, e.g.:
16
+
17
+ ``` terraform
18
+ ...
19
+ compute = {
20
+ general = {
21
+ nodes = ["general-0", "general-1"]
22
+ compute_init_enable = ["compute", ... ] # see below
23
+ }
24
+ }
25
+ ...
26
+ ```
27
+
28
+ ## Supported appliance functionalities
29
+
30
+ The string "compute" must be present in the ` compute_init_enable ` flag to enable
31
+ this functionality. The table below shows which other appliance functionalities
32
+ are currently supported - use the name in the role column to enable these.
33
+
34
+ | Playbook | Role (or functionality) | Support |
35
+ | -------------------------| -------------------------| -----------------|
36
+ | hooks/pre.yml | ? | None at present |
37
+ | validate.yml | n/a | Not relevant during boot |
38
+ | bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
39
+ | bootstrap.yml | resolv_conf | Fully supported |
40
+ | bootstrap.yml | etc_hosts | Fully supported |
41
+ | bootstrap.yml | proxy | None at present |
42
+ | bootstrap.yml | (/etc permissions) | None required - use image build |
43
+ | bootstrap.yml | (ssh /home fix) | None required - use image build |
44
+ | bootstrap.yml | (system users) | None required - use image build |
45
+ | bootstrap.yml | systemd | None required - use image build |
46
+ | bootstrap.yml | selinux | None required - use image build |
47
+ | bootstrap.yml | sshd | None at present |
48
+ | bootstrap.yml | dnf_repos | None at present (requirement TBD) |
49
+ | bootstrap.yml | squid | Not relevant for compute nodes |
50
+ | bootstrap.yml | tuned | None |
51
+ | bootstrap.yml | freeipa_server | Not relevant for compute nodes |
52
+ | bootstrap.yml | cockpit | None required - use image build |
53
+ | bootstrap.yml | firewalld | Not relevant for compute nodes |
54
+ | bootstrap.yml | fail2ban | Not relevant for compute nodes |
55
+ | bootstrap.yml | podman | Not relevant for compute nodes |
56
+ | bootstrap.yml | update | Not relevant during boot |
57
+ | bootstrap.yml | reboot | Not relevant for compute nodes |
58
+ | bootstrap.yml | ofed | Not relevant during boot |
59
+ | bootstrap.yml | ansible_init (install) | Not relevant during boot |
60
+ | bootstrap.yml | k3s (install) | Not relevant during boot |
61
+ | hooks/post-bootstrap.yml | ? | None at present |
62
+ | iam.yml | freeipa_client | None at present [ 1] |
63
+ | iam.yml | freeipa_server | Not relevant for compute nodes |
64
+ | iam.yml | sssd | None at present |
65
+ | filesystems.yml | block_devices | None required - role deprecated |
66
+ | filesystems.yml | nfs | All client functionality |
67
+ | filesystems.yml | manila | All functionality |
68
+ | filesystems.yml | lustre | None at present |
69
+ | extras.yml | basic_users | All functionality [ 2] |
70
+ | extras.yml | eessi | All functionality [ 3] |
71
+ | extras.yml | cuda | None required - use image build [ 4] |
72
+ | extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
73
+ | extras.yml | compute_init (export) | Not relevant for compute nodes |
74
+ | extras.yml | k9s (install) | Not relevant during boot |
75
+ | extras.yml | extra_packages | None at present. Would require dnf_repos |
76
+ | slurm.yml | mysql | Not relevant for compute nodes |
77
+ | slurm.yml | rebuild | Not relevant for compute nodes |
78
+ | slurm.yml | openhpc [ 5] | All slurmd-related functionality |
79
+ | slurm.yml | (set memory limits) | None at present |
80
+ | slurm.yml | (block ssh) | None at present |
81
+ | portal.yml | (openondemand server) | Not relevant for compute nodes |
82
+ | portal.yml | (openondemand vnc desktop) | None required - use image build |
83
+ | portal.yml | (openondemand jupyter server) | None required - use image build |
84
+ | monitoring.yml | (all monitoring) | None at present [ 6] |
85
+ | disable-repos.yml | dnf_repos | None at present (requirement TBD) |
86
+ | hooks/post.yml | ? | None at present |
87
+
88
+
89
+ Notes:
90
+ 1 . FreeIPA client functionality would be better provided using a client fork
91
+ which uses pkinit keys rather than OTP to reenrol nodes.
92
+ 2 . Assumes home directory already exists on shared storage.
93
+ 3 . Assumes ` cvmfs_config ` is the same on control node and all compute nodes
94
+ 4 . If ` cuda ` role was run during build, the nvidia-persistenced is enabled
95
+ and will start during boot.
96
+ 5 . ` openhpc ` does not need to be added to ` compute_init_enable ` , this is
97
+ automatically enabled by adding ` compute ` .
98
+ 5 . Only node-exporter tasks are relevant, and will be done via k3s in a future release.
99
+
100
+
101
+ ## Approach
9
102
This works as follows:
10
103
1 . During image build, an ansible-init playbook and supporting files
11
104
(e.g. templates, filters, etc) are installed.
@@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
31
124
to configure the node before the services on the control node are available
32
125
(which requires running the site.yml playbook).
33
126
34
- The following roles/groups are currently fully functional:
35
- - ` resolv_conf ` : all functionality
36
- - ` etc_hosts ` : all functionality
37
- - ` nfs ` : client functionality only
38
- - ` manila ` : all functionality
39
- - ` basic_users ` : all functionality, assumes home directory already exists on
40
- shared storage
41
- - ` eessi ` : all functionality, assumes ` cvmfs_config ` is the same on control
42
- node and all compute nodes.
43
- - ` openhpc ` : all functionality
44
-
45
- The above may be enabled by setting the compute_init_enable property on the
46
- tofu compute variable.
47
-
48
- # Development/debugging
127
+ ## Development/debugging
49
128
50
129
To develop/debug changes to the compute script without actually having to build
51
130
a new image:
@@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
83
162
as in step 3.
84
163
85
164
86
- # Design notes
165
+ ## Design notes
87
166
- Duplicating code in roles into the ` compute-init ` script is unfortunate, but
88
167
does allow developing this functionality without wider changes to the
89
168
appliance.
0 commit comments