Skip to content

Commit 6e3a808

Browse files
committed
Merge branch 'main' into fix/tuned-hpc-compute-hugemem
2 parents 36e5714 + 8e4d80c commit 6e3a808

File tree

42 files changed

+436
-190
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+436
-190
lines changed
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Release images
2+
on:
3+
workflow_dispatch:
4+
release:
5+
types:
6+
- published # should work for both pre-releases and releases
7+
env:
8+
IMAGE_PATH: environments/.stackhpc/tofu/cluster_image.auto.tfvars.json
9+
jobs:
10+
ci-image-release:
11+
name: ci-image-release
12+
runs-on: ubuntu-22.04
13+
concurrency: ${{ github.workflow }}-${{ github.ref }}
14+
strategy:
15+
fail-fast: false
16+
matrix:
17+
build:
18+
- RL8
19+
- RL9
20+
steps:
21+
- uses: actions/checkout@v2
22+
23+
- name: Write s3cmd configuration
24+
run: echo "${{ secrets.ARCUS_S3_CFG }}" > ~/.s3cfg
25+
26+
- name: Install s3cmd
27+
run: |
28+
sudo apt-get update
29+
sudo apt-get --yes install s3cmd
30+
31+
- name: Retrieve image name
32+
run: |
33+
TARGET_IMAGE=$(jq --arg version "${{ matrix.build }}" -r '.cluster_image[$version]' "${{ env.IMAGE_PATH }}")
34+
echo "TARGET_IMAGE=${TARGET_IMAGE}" >> "$GITHUB_ENV"
35+
36+
- name: Copy image from pre-release to release bucket
37+
run: s3cmd cp s3://openhpc-images-prerelease/${{ env.TARGET_IMAGE }} s3://openhpc-images

README.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ The default configuration in this repository may be used to create a cluster to
2525
- Persistent state backed by an OpenStack volume.
2626
- NFS-based shared file system backed by another OpenStack volume.
2727

28-
Note that the Open OnDemand portal and its remote apps are not usable with this default configuration.
29-
3028
It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud.
3129

3230
Before starting ensure that:

ansible/bootstrap.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@
143143
- appliances_mode == 'configure'
144144
- not (dnf_repos_allow_insecure_creds | default(false)) # useful for development
145145

146-
- hosts: cacerts:!builder
146+
- hosts: cacerts
147147
tags: cacerts
148148
gather_facts: false
149149
tasks:

ansible/roles/cacerts/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Configure CA certificates and trusts.
44

55
## Role variables
66

7-
- `ca-certificates`: Optional str. Path to directory containing certificates
7+
- `cacerts_cert_dir`: Optional str. Path to directory containing certificates
88
in PEM or DER format. Any files here will be added to the list of CAs trusted
99
by the system.
1010

ansible/roles/cluster_infra/templates/outputs.tf.j2

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,12 @@ output "cluster_nodes" {
3232
}
3333
}
3434
],
35-
{% for partition in openhpc_slurm_partitions %}
35+
{% for nodegroup in openhpc_nodegroups %}
3636
[
37-
for compute in openstack_compute_instance_v2.{{ partition.name }}: {
37+
for compute in openstack_compute_instance_v2.{{ nodegroup.name }}: {
3838
name = compute.name
3939
ip = compute.network[0].fixed_ip_v4
40-
groups = ["compute", "{{ cluster_name }}_compute", "{{ cluster_name }}_{{ partition.name }}"],
40+
groups = ["compute", "{{ cluster_name }}_compute", "{{ cluster_name }}_{{ nodegroup.name }}"],
4141
facts = {
4242
openstack_project_id = data.openstack_identity_auth_scope_v3.scope.project_id
4343
}

ansible/roles/cluster_infra/templates/resources.tf.j2

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -282,11 +282,11 @@ resource "openstack_networking_port_v2" "control_storage" {
282282
###
283283
# Workers
284284
###
285-
{% for partition in openhpc_slurm_partitions %}
285+
{% for nodegroup in openhpc_nodegroups %}
286286
# Primary network
287-
resource "openstack_networking_port_v2" "{{ partition.name }}" {
288-
count = {{ partition.count }}
289-
name = "{{ cluster_name }}-compute-{{ partition.name }}-${count.index}"
287+
resource "openstack_networking_port_v2" "{{ nodegroup.name }}" {
288+
count = {{ nodegroup.count }}
289+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-${count.index}"
290290
network_id = "${data.openstack_networking_network_v2.cluster_network.id}"
291291
admin_state_up = "true"
292292

@@ -305,9 +305,9 @@ resource "openstack_networking_port_v2" "{{ partition.name }}" {
305305

306306
# Storage network
307307
{% if cluster_storage_network is defined %}
308-
resource "openstack_networking_port_v2" "{{ partition.name }}_storage" {
309-
count = {{ partition.count }}
310-
name = "{{ cluster_name }}-compute-{{ partition.name }}-storage-${count.index}"
308+
resource "openstack_networking_port_v2" "{{ nodegroup.name }}_storage" {
309+
count = {{ nodegroup.count }}
310+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-storage-${count.index}"
311311
network_id = data.openstack_networking_network_v2.cluster_storage.id
312312
admin_state_up = "true"
313313

@@ -499,25 +499,25 @@ resource "openstack_compute_instance_v2" "control" {
499499
}
500500
}
501501

502-
{% for partition in openhpc_slurm_partitions %}
503-
resource "openstack_compute_instance_v2" "{{ partition.name }}" {
504-
count = {{ partition.count }}
502+
{% for nodegroup in openhpc_nodegroups %}
503+
resource "openstack_compute_instance_v2" "{{ nodegroup.name }}" {
504+
count = {{ nodegroup.count }}
505505

506-
name = "{{ cluster_name }}-compute-{{ partition.name }}-${count.index}"
506+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-${count.index}"
507507
image_id = "{{ cluster_image }}"
508-
{% if 'flavor_name' in partition %}
509-
flavor_name = "{{ partition.flavor_name }}"
508+
{% if 'flavor_name' in nodegroup %}
509+
flavor_name = "{{ nodegroup.flavor_name }}"
510510
{% else %}
511-
flavor_id = "{{ partition.flavor }}"
511+
flavor_id = "{{ nodegroup.flavor }}"
512512
{% endif %}
513513

514514
network {
515-
port = openstack_networking_port_v2.{{ partition.name }}[count.index].id
515+
port = openstack_networking_port_v2.{{ nodegroup.name }}[count.index].id
516516
}
517517

518518
{% if cluster_storage_network is defined %}
519519
network {
520-
port = openstack_networking_port_v2.{{ partition.name }}_storage[count.index].id
520+
port = openstack_networking_port_v2.{{ nodegroup.name }}_storage[count.index].id
521521
}
522522
{% endif %}
523523

ansible/roles/cuda/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,6 @@ Requires OFED to be installed to provide required kernel-* packages.
1010

1111
- `cuda_repo_url`: Optional. URL of `.repo` file. Default is upstream for appropriate OS/architecture.
1212
- `cuda_nvidia_driver_stream`: Optional. Version of `nvidia-driver` stream to enable. This controls whether the open or proprietary drivers are installed and the major version. Changing this once the drivers are installed does not change the version.
13-
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-8']`.
13+
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-9']`.
1414
- `cuda_package_version`: Optional. Default `latest` which will install the latest packages if not installed but won't upgrade already-installed packages. Use `'none'` to skip installing CUDA.
1515
- `cuda_persistenced_state`: Optional. State of systemd `nvidia-persistenced` service. Values as [ansible.builtin.systemd:state](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html#parameter-state). Default `started`.

ansible/roles/cuda/defaults/main.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos/rhel{{ ansible_distribution_major_version }}/{{ ansible_architecture }}/cuda-rhel{{ ansible_distribution_major_version }}.repo"
2-
cuda_nvidia_driver_stream: '570-open'
3-
cuda_package_version: '12.8.1-1'
4-
cuda_version_short: '12.8'
2+
cuda_nvidia_driver_stream: '575-open'
3+
cuda_package_version: '12.9.0-1'
4+
cuda_version_short: '12.9'
55
cuda_packages:
66
- "cuda{{ ('-' + cuda_package_version) if cuda_package_version != 'latest' else '' }}"
77
- nvidia-gds
88
- cmake
9-
- cuda-toolkit-12-8
9+
- cuda-toolkit-12-9
1010
cuda_samples_release_url: "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v{{ cuda_version_short }}.tar.gz"
1111
cuda_samples_path: "/var/lib/{{ ansible_user }}/cuda_samples"
1212
cuda_samples_programs:

ansible/roles/cuda/tasks/samples.yml

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -25,36 +25,3 @@
2525
cmd: . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
2626
chdir: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build"
2727
creates: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
28-
29-
- name: Run CUDA deviceQuery
30-
command:
31-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
32-
register: _cuda_devicequery
33-
34-
- name: Set fact for CUDA devices
35-
set_fact:
36-
cuda_devices: "{{ _cuda_devicequery.stdout | regex_findall('Device (\\d+):') }}"
37-
38-
- name: Run CUDA bandwidth test
39-
command:
40-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/bandwidthTest/bandwidthTest --device={{ item }}"
41-
register: _cuda_bandwidthtest
42-
loop: "{{ cuda_devices }}"
43-
loop_control:
44-
label: "Device {{ item }}" # e.g '0'
45-
46-
- name: Summarise bandwidth test output
47-
debug:
48-
msg: |
49-
{{ _parts[1].splitlines()[0] | trim }}
50-
Bandwidths: (Gb/s)
51-
Host to Device: {{ _parts[2].split()[-1] }}
52-
Device to Host: {{ _parts[3].split()[-1] }}
53-
Device to Device: {{ _parts[4].split()[-1] }}
54-
{{ ': '.join(_parts[5].split('=') | map('trim')) }}
55-
{{ _parts[6] }}
56-
loop: "{{ _cuda_bandwidthtest.results }}"
57-
vars:
58-
_parts: "{{ item.stdout.split('\n\n') }}"
59-
loop_control:
60-
label: "Device {{ item.item }}" # e.g '0'

ansible/roles/lustre/README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ Install and configure a Lustre client. This builds RPM packages from source.
77
**NB:** Currently this only supports RockyLinux 9.
88

99
## Role Variables
10-
11-
- `lustre_version`: Optional str. Version of lustre to build, default `2.15.6` which is the first version with EL9.5 support
10+
The following variables control configuration of Lustre clients.
1211
- `lustre_lnet_label`: Optional str. The "lnet label" part of the host's NID, e.g. `tcp0`. Only the `tcp` protocol type is currently supported. Default `tcp`.
1312
- `lustre_mgs_nid`: Required str. The NID(s) for the MGS, e.g. `192.168.227.11@tcp1` (separate mutiple MGS NIDs using `:`).
1413
- `lustre_mounts`: Required list. Define Lustre filesystems and mountpoints as a list of dicts with keys:
@@ -19,7 +18,11 @@ Install and configure a Lustre client. This builds RPM packages from source.
1918
- `lustre_mount_state`. Optional default mount state for all mounts, as for [ansible.posix.mount](https://docs.ansible.com/ansible/latest/collections/ansible/posix/mount_module.html#parameter-state). Default is `mounted`.
2019
- `lustre_mount_options`. Optional default mount options. Default values are systemd defaults from [Lustre client docs](http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes).
2120

22-
The following variables control the package build and and install and should not generally be required:
21+
The following variables control the package build and and install:
22+
- `lustre_version`: Optional str. Version of lustre to build, default `2.15.6/lu-18085`
23+
which is the first version with EL9.5 support, plus a fix for https://jira.whamcloud.com/browse/LU-18085.
24+
- `lustre_repo`: Optional str. URL for Lustre repo. Default is a StackHPC repo
25+
incorporating the above fix.
2326
- `lustre_build_packages`: Optional list. Prerequisite packages required to build Lustre. See `defaults/main.yml`.
2427
- `lustre_build_dir`: Optional str. Path to build lustre at, default `/tmp/lustre-release`.
2528
- `lustre_configure_opts`: Optional list. Options to `./configure` command. Default builds client rpms supporting Mellanox OFED, without support for GSS keys.

0 commit comments

Comments
 (0)