Skip to content

Commit b11696e

Browse files
authored
Improve build group definitions (#788)
* support raid root disks in stackhpc-built images * clarify image requirements * bump CI image * remove default build groups * fixup doca/cuda inventory groups * add fatimage inventory group * update docs for image build * minor docs tweaks * fixup fatimage group definition * fix build groups * bump CI image * minor docs tweak * fix linter markdown error * fix linter markdown error * swap example site image build to normal case * fix borked merge * fixes after self-review * bump CI image
1 parent 82c814e commit b11696e

File tree

7 files changed

+195
-96
lines changed

7 files changed

+195
-96
lines changed

.github/workflows/fatimage.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,10 @@ jobs:
3636
build:
3737
- image_name: openhpc-RL8
3838
source_image_name: Rocky-8-GenericCloud-Base-8.10-20240528.0.x86_64.raw
39-
inventory_groups: control,compute,login,update
39+
inventory_groups: fatimage
4040
- image_name: openhpc-RL9
4141
source_image_name: Rocky-9-GenericCloud-Base-9.6-20250531.0.x86_64.qcow2
42-
inventory_groups: control,compute,login,update
42+
inventory_groups: fatimage
4343
env:
4444
ANSIBLE_FORCE_COLOR: True
4545
OS_CLOUD: openstack

docs/image-build.md

Lines changed: 117 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,136 @@
11
# Packer-based image build
22

3-
The appliance contains code and configuration to use [Packer](https://developer.hashicorp.com/packer) with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.
4-
5-
The Packer configuration defined here builds "fat images" which contain packages, binaries and container images but no cluster-specific configuration. Using these:
6-
7-
- Enables the image to be tested in CI before production use.
8-
- Ensures re-deployment of the cluster or deployment of additional nodes can be completed even if packages are changed in upstream repositories (e.g. due to RockyLinux or OpenHPC updates).
9-
- Improves deployment speed by reducing the number of package downloads to improve deployment speed.
10-
11-
The fat images StackHPC builds and tests in CI are available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases). However with some additional configuration it is also possible to:
12-
13-
1. Build site-specific fat images from scratch.
14-
2. Extend an existing fat image with additional functionality.
3+
The appliance contains configuration to use [Packer](https://developer.hashicorp.com/packer)
4+
with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack)
5+
to build images. Using images:
6+
7+
- Enables the image to be tested in a `staging` environment before deployment
8+
to the `production` environment.
9+
- Ensures re-deployment of the cluster or deployment of additional nodes is
10+
repeatable.
11+
- Improves deployment speed by reducing the number of package installation.
12+
13+
The Packer configuration here can be used to build two types of images:
14+
15+
1. "Fat images" which contain packages, binaries and container images but no
16+
cluster-specific configuration. These start from a RockyLinux GenericCloud
17+
(or compatible) image. The fat images StackHPC builds and tests in CI are
18+
available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases).
19+
However site-specific fat images can also be built from a different source
20+
image e.g. if a different partition layout is required.
21+
2. "Extra-build" images which extend a fat image to create a site-specific
22+
image with with additional packages or functionality. For example the NVIDIA
23+
`cuda` packages cannot be redistributed hence require an "extra" build.
1524

1625
## Usage
1726

18-
To build either a site-specific fat image from scratch, or to extend an existing StackHPC fat image:
19-
20-
1. Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
21-
2. The provided dev credentials for StackHPC's "Ark" Pulp server must be added to the target environments. This is done by overriding `dnf_repos_username` and `dnf_repos_password` with your vault encrypted credentials in `environments/<base_environment>/inventory/group_vars/all/pulp.yml`. See the [experimental docs](experimental/pulp.md) if you wish instead wish to use a local Pulp server.
22-
3. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables) at e.g. `environments/<environment>/builder.pkrvars.hcl` containing at a minimum:
23-
24-
```hcl
25-
flavor = "general.v1.small" # VM flavor to use for builder VMs
26-
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
27-
source_image_name = "Rocky-9-GenericCloud-Base-9.4" # Name of image to create VM with, i.e. starting image
28-
inventory_groups = "control,login,compute" # Additional inventory groups to add build VM to
29-
```
30-
31-
Note that:
32-
33-
- The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).
34-
- The flavor used must have sufficent memory for the build tasks, but otherwise does not need to match the final cluster nodes. Usually 8GB is sufficent. By default, the build VM is volume-backed to allow control of the root disk size (and hence final image size) so the flavor disk size does not matter.
35-
- The source image should be either a RockyLinux GenericCloud image for a site-specific image build from scratch, or a StackHPC fat image if extending an existing image.
36-
- The `inventory_groups` variable takes a comma-separated list of Ansible inventory groups to add the build VM to. This is in addition to the `builder` group which it is always added to. This controls which Ansible roles and functionality run during build, and hence what gets added to the image.
37-
All possible groups are listed in `environments/common/groups` but common options for this variable will be:
38-
- `update,control,login,compute`: The resultant image has all packages in the source image updated, and then packages for all types of nodes in the cluster are added. When using a GenericCloud image for `source_image_name` this builds a site-specific fat image from scratch.
39-
- One or more specific groups which are not enabled in the appliance by default, e.g. `lustre`. When using a StackHPC fat image for `source_image_name` this extends the image with just this additional functionality.
27+
For either a site-specific fat-image build or an extra-build:
28+
29+
1. Ensure the current OpenStack credentials have sufficient authorisation to
30+
upload images (this may or may not require the `member` role for an
31+
application credential, depending on your OpenStack configuration).
32+
2. If package installs are required, add the provided dev credentials for
33+
StackHPC's "Ark" Pulp server to the `site` environment:
34+
35+
```yaml
36+
# environments/site/inventory/group_vars/all/dnf_repos.yml:
37+
dnf_repos_username: your-ark-username
38+
dnf_repos_password: "{{ vault_dnf_repos_password }}"
39+
```
40+
41+
```yaml
42+
# environments/site/inventory/group_vars/all/dnf_repos.yml:
43+
dnf_repos_password: "your-ark-password"
44+
```
45+
46+
> [!IMPORTANT]
47+
> The latter file should be vault-encrypted.
48+
49+
Alternatively, configure a [local Pulp mirror](experimental/pulp.md).
50+
51+
3. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables). It must specify at least the
52+
the following variables:
53+
54+
```hcl
55+
# environments/site/builder.pkrvars.hcl:
56+
flavor = "general.v1.small" # VM flavor to use for builder VMs
57+
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
58+
source_image_name = "Rocky-9-GenericCloud-Base-9.4" # Name of image to create VM with, i.e. starting image
59+
inventory_groups = "doca,cuda,extra_packages" # Build VM inventory groups => functionality to add to image
60+
```
61+
62+
See the top of [packer/openstack.pkr.hcl](../packer/openstack.pkr.hcl)
63+
for all possible variables which can be set.
64+
65+
Note that:
66+
67+
- Normally the network must provide outbound internet access. However it
68+
does not need to provide access to resources used by the actual cluster
69+
nodes (e.g. Slurm control node, network filesystem servers etc.).
70+
- The flavor used must have sufficent memory for the build tasks (usually
71+
8GB), but otherwise does not need to match the actual cluster node
72+
flavor(s).
73+
- By default, the build VM is volume-backed to allow control of the root
74+
disk size (and hence final image size), so the flavor's disk size does not
75+
matter. The default volume size is not sufficent if enabling `cuda` and/or
76+
`doca` and should be increased:
77+
```terraform
78+
volume_size = 35 # GB
79+
```
80+
- The source image should be either:
81+
- For a site-specific fatimage build: A RockyLinux GenericCloud or
82+
compatible image.
83+
- For an extra-build image: Usually the appropriate StackHPC fat image,
84+
as defined in `environments/.stackhpc/tofu/cluster_image.auto.tfvars.json` at the
85+
checkout's current commit. See the [GitHub release page](https://github.com/stackhpc/ansible-slurm-appliance/releases)
86+
for download links. In some cases extra builds may be chained, e.g.
87+
one extra build adds a Lustre client, and the resulting image is used
88+
as the source image for an extra build adding GPU support.
89+
- The `inventory_groups` variable takes a comma-separated list of Ansible
90+
inventory groups to add the build VM to (in addition to the `builder`
91+
group which is it always in). This controls which Ansible roles and
92+
functionality run during build, and hence what gets added to the image.
93+
All possible groups are listed in `environments/common/groups` but common
94+
options for this variable will be:
95+
96+
- For a fatimage build: `fatimage`: This is defined in `enviroments/site/inventory/groups`
97+
and results in an update of all packages in the source image, plus
98+
installation of packages for default control, login and compute nodes.
99+
100+
- For an extra-built image, one or more specific groups. This extends the
101+
source image with just this additional functionality. The example above
102+
installs NVIDIA DOCA network drivers, NVIDIA GPU drivers/Cuda packages
103+
and also enables installation of packages defined in the
104+
`appliances_extra_packages_other` variable (see
105+
[docs/operations.md](./operations.md#adding-additional-packages)).
40106
41107
4. Activate the venv and the relevant environment.
42108
43109
5. Build images using the relevant variable definition file, e.g.:
44110
45-
```shell
46-
cd packer/
47-
PACKER_LOG=1 /usr/bin/packer build -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
48-
```
111+
```shell
112+
cd packer/
113+
PACKER_LOG=1 /usr/bin/packer build -on-error=ask -var-file=../environments/site/builder.pkrvars.hcl openstack.pkr.hcl
114+
```
49115

50-
**NB:** If the build fails while creating the volume, check if the source image has the `signature_verified` property:
116+
**NB:** If the build fails while creating the volume, check if the source image has the `signature_verified` property:
51117

52-
```shell
53-
openstack image show $SOURCE_IMAGE
54-
```
118+
```shell
119+
openstack image show $SOURCE_IMAGE
120+
```
55121

56-
If it does, remove this property:
122+
If it does, remove this property:
57123

58-
```shell
59-
openstack image unset --property signature_verified $SOURCE_IMAGE
60-
```
124+
```shell
125+
openstack image unset --property signature_verified $SOURCE_IMAGE
126+
```
61127

62-
then delete the failed volume, select cancelling the build when Packer queries, and then retry. This is [OpenStack bug 1823445](https://bugs.launchpad.net/cinder/+bug/1823445).
128+
then delete the failed volume, select cancelling the build when Packer asks,
129+
and then retry. This is [OpenStack bug 1823445](https://bugs.launchpad.net/cinder/+bug/1823445).
63130

64-
6. The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc` and including a timestamp and a shortened Git hash.
131+
6. The built image will be automatically uploaded to OpenStack. By default it
132+
will have a name prefixed `openhpc` and including a timestamp and a shortened
133+
Git hash.
65134

66135
## Build Process
67136

docs/operations.md

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ disabled during runtime to prevent Ark credentials from being leaked. To enable
8383

8484
In both cases, Ark credentials will be required.
8585

86-
=# Adding Additional Packages
86+
## Adding Additional Packages
8787

8888
By default, the following utility packages are installed during the StackHPC image build:
8989

@@ -101,22 +101,27 @@ By default, the following utility packages are installed during the StackHPC ima
101101

102102
Additional packages can be added during image builds by:
103103

104-
- adding the `extra_packages` group to the build `inventory_groups` (see
105-
[docs/image-build.md](./image-build.md))
106-
- defining a list of packages in `appliances_extra_packages_other` in e.g.
107-
`environments/$SITE_ENV/inventory/group_vars/all/defaults.yml`. For example:
104+
1. Configuring an [image build](./image-build.md) to enable the
105+
`extra_packages` group:
108106

109-
```yaml
110-
# environments/foo-base/inventory/group_vars/all/defaults.yml:
111-
appliances_extra_packages_other:
112-
- somepackage
113-
- anotherpackage
114-
```
107+
```terraform
108+
# environments/site/builder.pkrvars.hcl:
109+
...
110+
inventory_groups = "extra_packages"
111+
...
112+
```
113+
114+
2. Defining a list of packages in `appliances_extra_packages_other`, for example:
115115

116-
For packages which come from repositories mirrored by StackHPC's "Ark" Pulp server
117-
(including rocky, EPEL and OpenHPC repositories), this will require either [Ark
118-
credentials](./image-build.md)) or a [local Pulp mirror](./experimental/pulp.md)
119-
to be configured. This includes rocky, EPEL and OpenHPC repos.
116+
```yaml
117+
# environments/site/inventory/group_vars/all/defaults.yml:
118+
appliances_extra_packages_other:
119+
- somepackage
120+
- anotherpackage
121+
```
122+
123+
3. Either adding [Ark credentials](./image-build.md) or a [local Pulp mirror](./experimental/pulp.md)
124+
to provide access to the required [repository snapshots](../environments/common/inventory/group_vars/all/dnf_repo_timestamps.yml).
120125
121126
The packages available from the OpenHPC repos are described in Appendix E of
122127
the OpenHPC installation guide (linked from the
@@ -125,9 +130,9 @@ the OpenHPC installation guide (linked from the
125130
corresponding `lmod` modules.
126131

127132
Packages _may_ also be installed during the site.yml, by adding the `cluster`
128-
group into the `extra_packages` group. An error will occur if Ark credentials
129-
are defined in this case, as they are readable by unprivileged users in the
130-
`.repo` files and a local Pulp mirror must be used instead.
133+
group as a child of the `extra_packages` group. An error will occur if Ark
134+
credential are defined in this case, as they are readable by unprivileged users
135+
in the `.repo` files and a local Pulp mirror must be used instead.
131136

132137
If additional repositories are required, these could be added/enabled as necessary in a play added to `environments/$SITE_ENV/hooks/{pre,post}.yml` as appropriate.
133138
Note such a play should NOT exclude the builder group, so that the repositories are also added to built images.
@@ -148,8 +153,6 @@ ansible-playbook environments/$SITE_ENV/hooks/{pre,post}.yml
148153

149154
as appropriate.
150155

151-
TODO: improve description about adding these to extra images.
152-
153156
## Reconfiguring Slurm
154157

155158
At a minimum run:
Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# Unless noted otherwise features enabled here are tested by CI site.yml playbook
2+
13
[basic_users:children]
24
cluster
35

@@ -20,28 +22,16 @@ cluster
2022
# --- end of FreeIPA example ---
2123

2224
[manila:children]
23-
# Allows demo; also installs manila client in fat image
25+
# Not actully tested but allows demo using this environment
2426
login
2527
compute
2628

2729
[chrony:children]
2830
cluster
2931

3032
[tuned:children]
31-
# Install tuned into fat image
32-
# NB: builder has tuned_enabled and tuned_started false so does not configure it
33-
builder
34-
# Also test tuned during site playbook
3533
cluster
3634

37-
[squid:children]
38-
# Install squid into fat image
39-
builder
40-
41-
[sssd:children]
42-
# Install sssd into fat image
43-
builder
44-
4535
[rebuild:children]
4636
control
4737

@@ -50,7 +40,3 @@ cluster
5040

5141
[compute_init:children]
5242
compute
53-
54-
[raid:children]
55-
# Configure fatimage for raid
56-
builder
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"cluster_image": {
3-
"RL8": "openhpc-RL8-251001-1515-81a25814",
4-
"RL9": "openhpc-RL9-251001-1424-81a25814"
3+
"RL8": "openhpc-RL8-251002-1537-1d21952c",
4+
"RL9": "openhpc-RL9-251002-1456-1d21952c"
55
}
66
}

environments/common/inventory/groups

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
# This file
2+
# 1. Ensures all groups in the appliance are always defined - even if empty
3+
# 2. Defines dependencies between groups - child groups require & enables parent
4+
#
5+
# IMPORTANT
6+
# ---------
7+
# All groups and child groups here MUST be empty, as other environments cannot
8+
# remove hosts/groups.
9+
110
[login]
211
# All Slurm login nodes. Combined control/login nodes are not supported.
312

@@ -129,6 +138,9 @@ prometheus
129138
freeipa_server
130139
freeipa_client
131140

141+
[doca]
142+
# Add `builder` to install NVIDIA DOCA during image build
143+
132144
[cuda]
133145
# Hosts to install NVIDIA CUDA on - see ansible/roles/cuda/README.md
134146

@@ -193,9 +205,10 @@ k3s_agent
193205

194206
[dnf_repos:children]
195207
# Hosts to replace system repos with Pulp repos
196-
# Warning: when using Ark directly rather than a local Pulp server, adding hosts other than `builder` will leak Ark creds to users
197-
builder
208+
# Roles/groups listed here *always* do installs:
198209
extra_packages
210+
doca
211+
# TODO: can't express: if cuda and builder, enable dnf_repos
199212

200213
[pulp_site]
201214
# Add builder to this group to automatically sync pulp during image build
@@ -220,3 +233,6 @@ extra_packages
220233

221234
[raid]
222235
# Add `builder` to configure image for software raid
236+
237+
[fatimage]
238+
# Add build VM into this group to enable all features with this as child

0 commit comments

Comments
 (0)