Skip to content

Commit 5c8de5a

Browse files
committed
update docs for image build
1 parent 57d352b commit 5c8de5a

File tree

2 files changed

+96
-33
lines changed

2 files changed

+96
-33
lines changed

docs/image-build.md

Lines changed: 80 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,103 @@
11
# Packer-based image build
22

3-
The appliance contains code and configuration to use [Packer](https://developer.hashicorp.com/packer) with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.
3+
The appliance contains configuration to use [Packer](https://developer.hashicorp.com/packer)
4+
with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack)
5+
to build images. Using images:
6+
- Enables the image to be tested in a `staging` environment before deployment
7+
to the `production` environment.
8+
- Ensures re-deployment of the cluster or deployment of additional nodes is
9+
repeatable.
10+
- Improves deployment speed by reducing the number of package installation.
11+
12+
The Packer configuration here can be used to build two types of images:
13+
1. "Fat images" which contain packages, binaries and container images but no
14+
cluster-specific configuration. These start from a RockyLinux GenericCloud
15+
(or compatible) image. The fat images StackHPC builds and tests in CI are
16+
available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases).
17+
However site-specific fat images can also be built from a different source
18+
image e.g. if a different partition layout is required.
19+
2. "Extra-build" images which extend a StackHPC fat image to create a site-specific
20+
image with with additional packages or functionality. For example the NVIDIA
21+
`cuda` packages cannot be redistributed hence require an "extra" build.
422

5-
The Packer configuration defined here builds "fat images" which contain packages, binaries and container images but no cluster-specific configuration. Using these:
6-
- Enables the image to be tested in CI before production use.
7-
- Ensures re-deployment of the cluster or deployment of additional nodes can be completed even if packages are changed in upstream repositories (e.g. due to RockyLinux or OpenHPC updates).
8-
- Improves deployment speed by reducing the number of package downloads to improve deployment speed.
23+
# Usage
924

10-
The fat images StackHPC builds and tests in CI are available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases). However with some additional configuration it is also possible to:
11-
1. Build site-specific fat images from scratch.
12-
2. Extend an existing fat image with additional functionality.
25+
For either a site-specific fat-image build or an extra-build:
1326

27+
1. Ensure the current OpenStack credentials have sufficient authorisation to
28+
upload images (this may or may not require the `member` role for an
29+
application credential, depending on your OpenStack configuration).
30+
2. If package installs are required, add the provided dev credentials for
31+
StackHPC's "Ark" Pulp server to the `site` environment:
1432

15-
# Usage
33+
```yaml
34+
# environments/site/inventory/group_vars/all/dnf_repos.yml:
35+
dnf_repos_username: your-ark-username
36+
dnf_repos_password: "{{ vault_dnf_repos_password }}"
37+
```
38+
```yaml
39+
# environments/site/inventory/group_vars/all/dnf_repos.yml:
40+
dnf_repos_password: 'your-ark-password'
41+
```
42+
> [!IMPORTANT]
43+
> The latter file should be vault-encrypted.
1644
17-
To build either a site-specific fat image from scratch, or to extend an existing StackHPC fat image:
45+
Alternatively, configure a [local Pulp mirror](experimental/pulp.md).
1846
19-
1. Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
20-
2. The provided dev credentials for StackHPC's "Ark" Pulp server must be added to the target environments. This is done by overriding `dnf_repos_username` and `dnf_repos_password` with your vault encrypted credentials in `environments/<base_environment>/inventory/group_vars/all/pulp.yml`. See the [experimental docs](experimental/pulp.md) if you wish instead wish to use a local Pulp server.
21-
3. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables) at e.g. `environments/<environment>/builder.pkrvars.hcl` containing at a minimum:
47+
3. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables) containing at a minimum e.g.:
2248
2349
```hcl
50+
# environments/site/builder.pkrvars.hcl:
2451
flavor = "general.v1.small" # VM flavor to use for builder VMs
2552
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
2653
source_image_name = "Rocky-9-GenericCloud-Base-9.4" # Name of image to create VM with, i.e. starting image
27-
inventory_groups = "control,login,compute" # Additional inventory groups to add build VM to
54+
inventory_groups = "cuda" # Additional inventory groups to add build VM to
2855

2956
```
3057

3158
Note that:
32-
- The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).
33-
- The flavor used must have sufficent memory for the build tasks, but otherwise does not need to match the final cluster nodes. Usually 8GB is sufficent. By default, the build VM is volume-backed to allow control of the root disk size (and hence final image size) so the flavor disk size does not matter.
34-
- The source image should be either a RockyLinux GenericCloud image for a site-specific image build from scratch, or a StackHPC fat image if extending an existing image.
35-
- The `inventory_groups` variable takes a comma-separated list of Ansible inventory groups to add the build VM to. This is in addition to the `builder` group which it is always added to. This controls which Ansible roles and functionality run during build, and hence what gets added to the image. All possible groups are listed in `environments/common/groups` but common options for this variable will be:
36-
- `update,control,login,compute`: The resultant image has all packages in the source image updated, and then packages for all types of nodes in the cluster are added. When using a GenericCloud image for `source_image_name` this builds a site-specific fat image from scratch.
37-
- One or more specific groups which are not enabled in the appliance by default, e.g. `lustre`. When using a StackHPC fat image for `source_image_name` this extends the image with just this additional functionality.
59+
- Normally the network must provide outbound internet access. However it
60+
does not need to provide access to resources used by the actual cluster
61+
nodes (e.g. Slurm control node, network filesystem servers etc.).
62+
- The flavor used must have sufficent memory for the build tasks (usually
63+
8GB), but otherwise does not need to match the actual cluster node
64+
flavor(s).
65+
- By default, the build VM is volume-backed to allow control of the root
66+
disk size (and hence final image size), so the flavor's disk size does not
67+
matter. The default volume size is not sufficent if enabling `cuda` and/or
68+
`doca` and should be increased:
69+
```terraform
70+
volume_size = 35 # GB
71+
```
72+
- The source image should be either:
73+
- For a site-specific fatimage build: A RockyLinux GenericCloud or
74+
compatible image.
75+
- For an extra-build image: The appropriate StackHPC fat image, as defined
76+
in `environments/.stackhpc/tofu/cluster_image.auto.tfvars.json`. See the
77+
[GitHub release page](https://github.com/stackhpc/ansible-slurm-appliance/releases)
78+
for download links.
79+
- The `inventory_groups` variable takes a comma-separated list of Ansible
80+
inventory groups to add the build VM to (in addition to the `builder`
81+
group which is it always in). This controls which Ansible roles and
82+
functionality run during build, and hence what gets added to the image.
83+
All possible groups are listed in `environments/common/groups` but common
84+
options for this variable will be:
85+
- For a fatimage build: `fatimage`: This is defined in `enviroments/{common,site}/inventory/groups`
86+
and results in an update of all packages in the source image, plus
87+
installation of packages for default control, login and compute nodes.
88+
- For an extra-built image, one or more specific groups e.g. `cuda` or
89+
`doca,lustre`. This extends the source image with just this additional
90+
functionality.
91+
92+
See the top of [packer/openstack.pkr.hcl](../packer/openstack.pkr.hcl)
93+
for all possible variables which can be set.
3894

3995
4. Activate the venv and the relevant environment.
4096

4197
5. Build images using the relevant variable definition file, e.g.:
4298

4399
cd packer/
44-
PACKER_LOG=1 /usr/bin/packer build -on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
100+
PACKER_LOG=1 /usr/bin/packer build -on-error=ask -var-file=../environments/site/builder.pkrvars.hcl openstack.pkr.hcl
45101

46102
**NB:** If the build fails while creating the volume, check if the source image has the `signature_verified` property:
47103

@@ -53,7 +109,9 @@ To build either a site-specific fat image from scratch, or to extend an existing
53109

54110
then delete the failed volume, select cancelling the build when Packer queries, and then retry. This is [OpenStack bug 1823445](https://bugs.launchpad.net/cinder/+bug/1823445).
55111

56-
6. The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc` and including a timestamp and a shortened git hash.
112+
6. The built image will be automatically uploaded to OpenStack. By default it
113+
will have a name prefixed `openhpc` and including a timestamp and a shortened
114+
git hash.
57115

58116
# Build Process
59117

docs/operations.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -95,22 +95,29 @@ By default, the following utility packages are installed during the StackHPC ima
9595
- s-nail
9696

9797
Additional packages can be added during image builds by:
98-
- adding the `extra_packages` group to the build `inventory_groups` (see
99-
[docs/image-build.md](./image-build.md))
100-
- defining a list of packages in `appliances_extra_packages_other` in e.g.
101-
`environments/$SITE_ENV/inventory/group_vars/all/defaults.yml`. For example:
98+
99+
1. Configuring an [docs/image-build.md](./image-build.md) to enable the
100+
`extra_packages` group:
101+
102+
103+
```terraform
104+
# environments/site/builder.pkrvars.hcl:
105+
...
106+
inventory_groups = "extra_packages"
107+
...
108+
```
109+
110+
2. Defining a list of packages in `appliances_extra_packages_other`, for example:
102111
103112
```yaml
104-
# environments/foo-base/inventory/group_vars/all/defaults.yml:
113+
# environments/site/inventory/group_vars/all/defaults.yml:
105114
appliances_extra_packages_other:
106115
- somepackage
107116
- anotherpackage
108117
```
109118
110-
For packages which come from repositories mirrored by StackHPC's "Ark" Pulp server
111-
(including rocky, EPEL and OpenHPC repositories), this will require either [Ark
112-
credentials](./image-build.md)) or a [local Pulp mirror](./experimental/pulp.md)
113-
to be configured. This includes rocky, EPEL and OpenHPC repos.
119+
3. Either adding [Ark credentials](./image-build.md) or a [local Pulp mirror](./experimental/pulp.md)
120+
to provide access to the required [repository snapshots](../environments/common/inventory/group_vars/all/dnf_repo_timestamps.yml).
114121
115122
The packages available from the OpenHPC repos are described in Appendix E of
116123
the OpenHPC installation guide (linked from the
@@ -138,8 +145,6 @@ Adding these repos/packages to the cluster/image would then require running:
138145
139146
as appropriate.
140147
141-
TODO: improve description about adding these to extra images.
142-
143148
144149
# Reconfiguring Slurm
145150

0 commit comments

Comments
 (0)