1- # EXPERIMENTAL: compute-init
2-
3- Experimental / in-progress functionality to allow compute nodes to rejoin the
4- cluster after a reboot.
5-
6- To enable this add compute nodes (or a subset of them into) the ` compute_init `
7- group.
8-
1+ # EXPERIMENTAL: compute_init
2+
3+ Experimental functionality to allow compute nodes to rejoin the cluster after
4+ a reboot without running the ` ansible/site.yml ` playbook.
5+
6+ To enable this:
7+ 1 . Add the ` compute ` group (or a subset) into the ` compute_init ` group. This is
8+ the default when using cookiecutter to create an environment, via the
9+ "everything" template.
10+ 2 . Build an image which includes the ` compute_init ` group. This is the case
11+ for StackHPC-built release images.
12+ 3 . Enable the required functionalities during boot, by setting the
13+ ` compute_init_enable ` property for a compute group in the
14+ OpenTofu ` compute ` variable to a list which includes "compute", plus the
15+ other roles/functionalities required, e.g.:
16+
17+ ``` terraform
18+ ...
19+ compute = {
20+ general = {
21+ nodes = ["general-0", "general-1"]
22+ compute_init_enable = ["compute", ... ] # see below
23+ }
24+ }
25+ ...
26+ ```
27+
28+ ## Supported appliance functionalities
29+
30+ The string "compute" must be present in the ` compute_init_enable ` flag to enable
31+ this functionality. The table below shows which other appliance functionalities
32+ are currently supported - use the name in the role column to enable these.
33+
34+ | Playbook | Role (or functionality) | Support |
35+ | -------------------------| -------------------------| -----------------|
36+ | hooks/pre.yml | ? | None at present |
37+ | validate.yml | n/a | Not relevant during boot |
38+ | bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
39+ | bootstrap.yml | resolv_conf | Fully supported |
40+ | bootstrap.yml | etc_hosts | Fully supported |
41+ | bootstrap.yml | proxy | None at present |
42+ | bootstrap.yml | (/etc permissions) | None required - use image build |
43+ | bootstrap.yml | (ssh /home fix) | None required - use image build |
44+ | bootstrap.yml | (system users) | None required - use image build |
45+ | bootstrap.yml | systemd | None required - use image build |
46+ | bootstrap.yml | selinux | None required - use image build |
47+ | bootstrap.yml | sshd | None at present |
48+ | bootstrap.yml | dnf_repos | None at present (requirement TBD) |
49+ | bootstrap.yml | squid | Not relevant for compute nodes |
50+ | bootstrap.yml | tuned | None |
51+ | bootstrap.yml | freeipa_server | Not relevant for compute nodes |
52+ | bootstrap.yml | cockpit | None required - use image build |
53+ | bootstrap.yml | firewalld | Not relevant for compute nodes |
54+ | bootstrap.yml | fail2ban | Not relevant for compute nodes |
55+ | bootstrap.yml | podman | Not relevant for compute nodes |
56+ | bootstrap.yml | update | Not relevant during boot |
57+ | bootstrap.yml | reboot | Not relevant for compute nodes |
58+ | bootstrap.yml | ofed | Not relevant during boot |
59+ | bootstrap.yml | ansible_init (install) | Not relevant during boot |
60+ | bootstrap.yml | k3s (install) | Not relevant during boot |
61+ | hooks/post-bootstrap.yml | ? | None at present |
62+ | iam.yml | freeipa_client | None at present [ 1] |
63+ | iam.yml | freeipa_server | Not relevant for compute nodes |
64+ | iam.yml | sssd | None at present |
65+ | filesystems.yml | block_devices | None required - role deprecated |
66+ | filesystems.yml | nfs | All client functionality |
67+ | filesystems.yml | manila | All functionality |
68+ | filesystems.yml | lustre | None at present |
69+ | extras.yml | basic_users | All functionality [ 2] |
70+ | extras.yml | eessi | All functionality [ 3] |
71+ | extras.yml | cuda | None required - use image build [ 4] |
72+ | extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
73+ | extras.yml | compute_init (export) | Not relevant for compute nodes |
74+ | extras.yml | k9s (install) | Not relevant during boot |
75+ | extras.yml | extra_packages | None at present. Would require dnf_repos |
76+ | slurm.yml | mysql | Not relevant for compute nodes |
77+ | slurm.yml | rebuild | Not relevant for compute nodes |
78+ | slurm.yml | openhpc [ 5] | All slurmd-related functionality |
79+ | slurm.yml | (set memory limits) | None at present |
80+ | slurm.yml | (block ssh) | None at present |
81+ | portal.yml | (openondemand server) | Not relevant for compute nodes |
82+ | portal.yml | (openondemand vnc desktop) | None required - use image build |
83+ | portal.yml | (openondemand jupyter server) | None required - use image build |
84+ | monitoring.yml | (all monitoring) | None at present [ 6] |
85+ | disable-repos.yml | dnf_repos | None at present (requirement TBD) |
86+ | hooks/post.yml | ? | None at present |
87+
88+
89+ Notes:
90+ 1 . FreeIPA client functionality would be better provided using a client fork
91+ which uses pkinit keys rather than OTP to reenrol nodes.
92+ 2 . Assumes home directory already exists on shared storage.
93+ 3 . Assumes ` cvmfs_config ` is the same on control node and all compute nodes
94+ 4 . If ` cuda ` role was run during build, the nvidia-persistenced is enabled
95+ and will start during boot.
96+ 5 . ` openhpc ` does not need to be added to ` compute_init_enable ` , this is
97+ automatically enabled by adding ` compute ` .
98+ 5 . Only node-exporter tasks are relevant, and will be done via k3s in a future release.
99+
100+
101+ ## Approach
9102This works as follows:
101031 . During image build, an ansible-init playbook and supporting files
11104(e.g. templates, filters, etc) are installed.
@@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
31124to configure the node before the services on the control node are available
32125(which requires running the site.yml playbook).
33126
34- The following roles/groups are currently fully functional:
35- - ` resolv_conf ` : all functionality
36- - ` etc_hosts ` : all functionality
37- - ` nfs ` : client functionality only
38- - ` manila ` : all functionality
39- - ` basic_users ` : all functionality, assumes home directory already exists on
40- shared storage
41- - ` eessi ` : all functionality, assumes ` cvmfs_config ` is the same on control
42- node and all compute nodes.
43- - ` openhpc ` : all functionality
44-
45- The above may be enabled by setting the compute_init_enable property on the
46- tofu compute variable.
47-
48- # Development/debugging
127+ ## Development/debugging
49128
50129To develop/debug changes to the compute script without actually having to build
51130a new image:
@@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
83162as in step 3.
84163
85164
86- # Design notes
165+ ## Design notes
87166- Duplicating code in roles into the ` compute-init ` script is unfortunate, but
88167 does allow developing this functionality without wider changes to the
89168 appliance.
0 commit comments