Skip to content

Commit bc7540f

Browse files
authored
Merge branch 'main' into ci/test-compute-init
2 parents e760db7 + 92a73b7 commit bc7540f

File tree

3 files changed

+135
-2
lines changed

3 files changed

+135
-2
lines changed

ansible/roles/sshd/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@ Configure sshd.
77
- `sshd_password_authentication`: Optional bool. Whether to enable password login. Default `false`.
88
- `sshd_disable_forwarding`: Optional bool. Whether to disable all forwarding features (X11, ssh-agent, TCP and StreamLocal). Default `true`.
99
- `sshd_conf_src`: Optional string. Path to sshd configuration template. Default is in-role template.
10-
- `sshd_conf_dest`: Optional string. Path to destination for sshd configuration file. Default is `/etc/ssh/sshd_config.d/10-ansible.conf` which overides `50-{cloud-init,redhat}` files, if present.
10+
- `sshd_conf_dest`: Optional string. Path to destination for sshd configuration file. Default is `/etc/ssh/sshd_config.d/10-ansible.conf` which overrides `50-{cloud-init,redhat}` files, if present.

docs/production.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ production-ready deployments.
1313
- `staging`: staging environment
1414

1515
A `dev` environment should also be created if considered required, or this
16-
can be left until later.,
16+
can be left until later.
1717

1818
These can all be produced using the cookicutter instructions, but the
1919
`production` and `staging` environments will need their

docs/sequence.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Slurm Appliance Sequences
2+
3+
4+
5+
## Image build
6+
7+
This sequence applies to both:
8+
- "fatimage" builds, starting from GenericCloud images and using
9+
control,login,compute inventory groups to install all packages, e.g. StackHPC
10+
CI builds
11+
- "extra" builds, starting from StackHPC images and using selected inventory
12+
groups to add specfic features for a site-specific image.
13+
14+
Note that a generic Pulp server is shown in the below diagram. This may be
15+
StackHPC's Ark server or a local Pulp mirroring Ark. It is assumed a local Pulp
16+
has already had the relevant snapshots synced from Ark (although it is possible
17+
to trigger this during an image build).
18+
19+
Note that ansible-init does not run during an image build. It is disabled via
20+
a metadata flag.
21+
22+
```mermaid
23+
sequenceDiagram
24+
participant ansible as Ansible Deploy Host
25+
participant cloud as Cloud
26+
note over ansible: $ packer build ...
27+
ansible->>cloud: Create VM
28+
create participant packer as Build VM
29+
participant pulp as Pulp
30+
cloud->>packer: Create VM
31+
note over packer: Boot
32+
rect rgb(204, 232, 252)
33+
note right of packer: ansible-init
34+
packer->>cloud: Query metadata
35+
cloud->>packer: Metadata sent
36+
packer->>packer: Skip ansible-init
37+
end
38+
ansible->>packer: Wait for ssh connection
39+
rect rgb(204, 232, 252)
40+
note right of ansible: fatimage.yml
41+
ansible->>packer: Overwrite repo files with Pulp repos and update
42+
packer->>pulp: dnf update
43+
pulp-->>packer: Package updates
44+
ansible->>packer: Perform installation tasks
45+
ansible->>packer: Shutdown
46+
end
47+
ansible->>cloud: Create image from Build VM root disk
48+
destroy packer
49+
note over cloud: Image created
50+
```
51+
52+
## Cluster Creation
53+
54+
In the below it is assumed that no additional packages are installed beyond
55+
what is present in the image, i.e. Ark/local Pulp access is not required.
56+
57+
```mermaid
58+
sequenceDiagram
59+
participant ansible as Ansible Deploy Host
60+
participant cloud as Cloud
61+
rect rgb(204, 232, 252)
62+
note over ansible: $ ansible-playbook ansible/adhoc/generate-passwords.yml
63+
ansible->>ansible: Template secrets to inventory group_vars
64+
end
65+
rect rgb(204, 232, 252)
66+
note over ansible: $ tofu apply ...
67+
ansible->>cloud: Create infra
68+
create participant nodes as Cluster Instances
69+
cloud->>nodes: Create instances
70+
end
71+
note over nodes: Boot
72+
rect rgb(204, 232, 252)
73+
note right of nodes: ansible-init
74+
nodes->>cloud: Query metadata
75+
cloud->>nodes: Metadata sent
76+
end
77+
rect rgb(204, 232, 252)
78+
note over ansible: $ ansible-playbook ansible/site.yml
79+
ansible->>nodes: Wait for ansible-init completion
80+
ansible->>nodes: Ansible tasks
81+
note over nodes: All services running
82+
end
83+
```
84+
85+
## Slurm Controlled Rebuild
86+
87+
This sequence applies to active clusters, after running the `site.yml` playbook
88+
for the first time. Slurm controlled rebuild requires that:
89+
- Compute groups in the OpenTofu `compute` variable have:
90+
- `ignore_image_changes: true`
91+
- `compute_init_enable: ['compute', ... ]`
92+
- The Ansible `rebuild` inventory group contains the `control` group.
93+
94+
TODO: should also document how compute-init does NOT run if the `site.yml`
95+
playbook has not been run.
96+
97+
```mermaid
98+
sequenceDiagram
99+
participant ansible as Ansible Deploy Host
100+
participant cloud as Cloud
101+
participant nodes as Cluster Instances
102+
note over ansible: Update OpenTofu cluster_image variable [1]
103+
rect rgb(204, 232, 250)
104+
note over ansible: $ tofu apply ....
105+
ansible<<->>cloud: Check login/compute current vs desired images
106+
cloud->>nodes: Reimage login and control nodes
107+
ansible->>ansible: Update inventory/hosts.yml for<br>compute node image_id
108+
end
109+
rect rgb(204, 232, 250)
110+
note over ansible: $ ansible-playbook ansible/site.yml
111+
ansible->>nodes: Hostvars templated to nfs share
112+
ansible->>nodes: Ansible tasks
113+
note over nodes:All services running
114+
end
115+
note over nodes: $ srun --reboot ...
116+
rect rgb(204, 232, 250)
117+
note over nodes: RebootProgram [2]
118+
nodes->>cloud: Compare current instance image to target from hostvars
119+
cloud->>nodes: Reimage if target != current
120+
rect rgb(252, 200, 100)
121+
note over nodes: compute-init [3]
122+
nodes->>nodes: Retrieve hostvars from nfs mount
123+
nodes->>nodes: Run ansible tasks
124+
note over nodes: Compute nodes rejoin cluster
125+
end
126+
end
127+
nodes->>nodes: srun task completes
128+
```
129+
Notes:
130+
1. And/or login/compute group overrides
131+
2. Running on control node
132+
3. On hosts targeted by job
133+

0 commit comments

Comments
 (0)