Skip to content

Commit ff38509

Browse files
committed
Merge branch 'main' into feat/tf-nodegroup-typedefs
2 parents 8fc88da + dfdebdb commit ff38509

File tree

33 files changed

+230
-166
lines changed

33 files changed

+230
-166
lines changed

ansible/roles/cluster_infra/templates/outputs.tf.j2

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,12 @@ output "cluster_nodes" {
3232
}
3333
}
3434
],
35-
{% for partition in openhpc_slurm_partitions %}
35+
{% for nodegroup in openhpc_nodegroups %}
3636
[
37-
for compute in openstack_compute_instance_v2.{{ partition.name }}: {
37+
for compute in openstack_compute_instance_v2.{{ nodegroup.name }}: {
3838
name = compute.name
3939
ip = compute.network[0].fixed_ip_v4
40-
groups = ["compute", "{{ cluster_name }}_compute", "{{ cluster_name }}_{{ partition.name }}"],
40+
groups = ["compute", "{{ cluster_name }}_compute", "{{ cluster_name }}_{{ nodegroup.name }}"],
4141
facts = {
4242
openstack_project_id = data.openstack_identity_auth_scope_v3.scope.project_id
4343
}

ansible/roles/cluster_infra/templates/resources.tf.j2

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -282,11 +282,11 @@ resource "openstack_networking_port_v2" "control_storage" {
282282
###
283283
# Workers
284284
###
285-
{% for partition in openhpc_slurm_partitions %}
285+
{% for nodegroup in openhpc_nodegroups %}
286286
# Primary network
287-
resource "openstack_networking_port_v2" "{{ partition.name }}" {
288-
count = {{ partition.count }}
289-
name = "{{ cluster_name }}-compute-{{ partition.name }}-${count.index}"
287+
resource "openstack_networking_port_v2" "{{ nodegroup.name }}" {
288+
count = {{ nodegroup.count }}
289+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-${count.index}"
290290
network_id = "${data.openstack_networking_network_v2.cluster_network.id}"
291291
admin_state_up = "true"
292292

@@ -305,9 +305,9 @@ resource "openstack_networking_port_v2" "{{ partition.name }}" {
305305

306306
# Storage network
307307
{% if cluster_storage_network is defined %}
308-
resource "openstack_networking_port_v2" "{{ partition.name }}_storage" {
309-
count = {{ partition.count }}
310-
name = "{{ cluster_name }}-compute-{{ partition.name }}-storage-${count.index}"
308+
resource "openstack_networking_port_v2" "{{ nodegroup.name }}_storage" {
309+
count = {{ nodegroup.count }}
310+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-storage-${count.index}"
311311
network_id = data.openstack_networking_network_v2.cluster_storage.id
312312
admin_state_up = "true"
313313

@@ -499,25 +499,25 @@ resource "openstack_compute_instance_v2" "control" {
499499
}
500500
}
501501

502-
{% for partition in openhpc_slurm_partitions %}
503-
resource "openstack_compute_instance_v2" "{{ partition.name }}" {
504-
count = {{ partition.count }}
502+
{% for nodegroup in openhpc_nodegroups %}
503+
resource "openstack_compute_instance_v2" "{{ nodegroup.name }}" {
504+
count = {{ nodegroup.count }}
505505

506-
name = "{{ cluster_name }}-compute-{{ partition.name }}-${count.index}"
506+
name = "{{ cluster_name }}-compute-{{ nodegroup.name }}-${count.index}"
507507
image_id = "{{ cluster_image }}"
508-
{% if 'flavor_name' in partition %}
509-
flavor_name = "{{ partition.flavor_name }}"
508+
{% if 'flavor_name' in nodegroup %}
509+
flavor_name = "{{ nodegroup.flavor_name }}"
510510
{% else %}
511-
flavor_id = "{{ partition.flavor }}"
511+
flavor_id = "{{ nodegroup.flavor }}"
512512
{% endif %}
513513

514514
network {
515-
port = openstack_networking_port_v2.{{ partition.name }}[count.index].id
515+
port = openstack_networking_port_v2.{{ nodegroup.name }}[count.index].id
516516
}
517517

518518
{% if cluster_storage_network is defined %}
519519
network {
520-
port = openstack_networking_port_v2.{{ partition.name }}_storage[count.index].id
520+
port = openstack_networking_port_v2.{{ nodegroup.name }}_storage[count.index].id
521521
}
522522
{% endif %}
523523

ansible/roles/cuda/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,6 @@ Requires OFED to be installed to provide required kernel-* packages.
1010

1111
- `cuda_repo_url`: Optional. URL of `.repo` file. Default is upstream for appropriate OS/architecture.
1212
- `cuda_nvidia_driver_stream`: Optional. Version of `nvidia-driver` stream to enable. This controls whether the open or proprietary drivers are installed and the major version. Changing this once the drivers are installed does not change the version.
13-
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-8']`.
13+
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-9']`.
1414
- `cuda_package_version`: Optional. Default `latest` which will install the latest packages if not installed but won't upgrade already-installed packages. Use `'none'` to skip installing CUDA.
1515
- `cuda_persistenced_state`: Optional. State of systemd `nvidia-persistenced` service. Values as [ansible.builtin.systemd:state](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html#parameter-state). Default `started`.

ansible/roles/cuda/defaults/main.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos/rhel{{ ansible_distribution_major_version }}/{{ ansible_architecture }}/cuda-rhel{{ ansible_distribution_major_version }}.repo"
2-
cuda_nvidia_driver_stream: '570-open'
3-
cuda_package_version: '12.8.1-1'
4-
cuda_version_short: '12.8'
2+
cuda_nvidia_driver_stream: '575-open'
3+
cuda_package_version: '12.9.0-1'
4+
cuda_version_short: '12.9'
55
cuda_packages:
66
- "cuda{{ ('-' + cuda_package_version) if cuda_package_version != 'latest' else '' }}"
77
- nvidia-gds
88
- cmake
9-
- cuda-toolkit-12-8
9+
- cuda-toolkit-12-9
1010
cuda_samples_release_url: "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v{{ cuda_version_short }}.tar.gz"
1111
cuda_samples_path: "/var/lib/{{ ansible_user }}/cuda_samples"
1212
cuda_samples_programs:

ansible/roles/cuda/tasks/samples.yml

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -25,36 +25,3 @@
2525
cmd: . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
2626
chdir: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build"
2727
creates: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
28-
29-
- name: Run CUDA deviceQuery
30-
command:
31-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/deviceQuery/deviceQuery"
32-
register: _cuda_devicequery
33-
34-
- name: Set fact for CUDA devices
35-
set_fact:
36-
cuda_devices: "{{ _cuda_devicequery.stdout | regex_findall('Device (\\d+):') }}"
37-
38-
- name: Run CUDA bandwidth test
39-
command:
40-
cmd: "{{ cuda_samples_path }}/cuda-samples-{{ cuda_version_short }}/build/Samples/1_Utilities/bandwidthTest/bandwidthTest --device={{ item }}"
41-
register: _cuda_bandwidthtest
42-
loop: "{{ cuda_devices }}"
43-
loop_control:
44-
label: "Device {{ item }}" # e.g '0'
45-
46-
- name: Summarise bandwidth test output
47-
debug:
48-
msg: |
49-
{{ _parts[1].splitlines()[0] | trim }}
50-
Bandwidths: (Gb/s)
51-
Host to Device: {{ _parts[2].split()[-1] }}
52-
Device to Host: {{ _parts[3].split()[-1] }}
53-
Device to Device: {{ _parts[4].split()[-1] }}
54-
{{ ': '.join(_parts[5].split('=') | map('trim')) }}
55-
{{ _parts[6] }}
56-
loop: "{{ _cuda_bandwidthtest.results }}"
57-
vars:
58-
_parts: "{{ item.stdout.split('\n\n') }}"
59-
loop_control:
60-
label: "Device {{ item.item }}" # e.g '0'

ansible/roles/openondemand/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,10 +59,10 @@ This role enables SSL on the Open Ondemand server, using the following self-sign
5959
- `new_window`: Optional. Whether to open link in new window. Bool, default `false`.
6060
- `app_name`: Optional. Unique name for app appended to `/var/www/ood/apps/sys/`. Default is `name`, useful if that is not unique or not suitable as a path component.
6161
- `openondemand_dashboard_support_url`: Optional. URL or email etc to show as support contact under Help in dashboard. Default `(undefined)`.
62-
- `openondemand_desktop_partition`: Optional. Name of Slurm partition to use for remote desktops. Requires a corresponding group named "openondemand_desktop" and entry in openhpc_slurm_partitions.
62+
- `openondemand_desktop_partition`: Optional. Name of Slurm partition to use for remote desktops. Requires a corresponding group named "openondemand_desktop" and entry in openhpc_partitions.
6363
- `openondemand_desktop_screensaver`: Optional. Whether to enable screen locking/screensaver. **NB:** Users must have passwords if this is enabled. Bool, default `false`.
6464
- `openondemand_filesapp_paths`: List of paths (in addition to $HOME, which is always added) to include shortcuts to within the Files dashboard app.
65-
- `openondemand_jupyter_partition`: Required. Name of Slurm partition to use for Jupyter Notebook servers. Requires a corresponding group named "openondemand_jupyter" and entry in openhpc_slurm_partitions.
65+
- `openondemand_jupyter_partition`: Required. Name of Slurm partition to use for Jupyter Notebook servers. Requires a corresponding group named "openondemand_jupyter" and entry in openhpc_partitions.
6666

6767
### Monitoring
6868
- `openondemand_exporter`: Optional. Install the Prometheus [ondemand_exporter](https://github.com/OSC/ondemand_exporter) on the `openondemand` node to export metrics about Open Ondemand itself. Default `true`.

ansible/roles/rebuild/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The below are only used by this role's `rebuild.yml` task file, i.e. when
2323
running the `ansible/adhoc/rebuild-via-slurm.yml` playbook:
2424

2525
- `rebuild_job_partitions`: Optional. Comma-separated list of names of rebuild
26-
partitions defined in `openhpc_slurm_partitions`. Useful as an extra-var for
26+
partitions defined in `openhpc_partitions`. Useful as an extra-var for
2727
limiting rebuilds. Default `rebuild`.
2828

2929
- `rebuild_job_name`: Optional. Name of rebuild jobs. Default is `rebuild-`

ansible/validate.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,17 @@
2323
gather_facts: false
2424
tags: openhpc
2525
tasks:
26+
- import_role:
27+
name: stackhpc.openhpc
28+
tasks_from: validate.yml
2629
- assert:
2730
that: "'enable_configless' in openhpc_config.SlurmctldParameters | default([])"
2831
fail_msg: |
2932
'enable_configless' not found in openhpc_config.SlurmctldParameters - is variable openhpc_config overridden?
3033
Additional slurm.conf parameters should be provided using variable openhpc_config_extra.
3134
success_msg: Checked Slurm will be configured for configless operation
35+
delegate_to: localhost
36+
run_once: true
3237

3338
- name: Validate filebeat configuration
3439
hosts: filebeat

docs/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# StackHPC Slurm Appliance Documentation
2+
3+
### Operator docs
4+
5+
[Image build](image-build.md)
6+
7+
[CI](ci.md)
8+
9+
[Monitoring and logging](monitoring-and-logging.md)
10+
11+
[Operations guide](operations.md)
12+
13+
[Production deployment](production.md)
14+
15+
[Upgrades](upgrades.md)
16+
17+
[Sequence diagrams](sequence.md)
18+
19+
### Configuration docs
20+
21+
[Alerting](alerting.md)
22+
23+
[Chrony](chrony.md)
24+
25+
[Environments](environments.md)
26+
27+
[K3s](k3s.README.md)
28+
29+
[Networking](networks.md)
30+
31+
[Open OnDemand](openondemand.md)
32+
33+
[Persistent state](persistent-state.md)
34+
35+
#### Experimental fetaures
36+
37+
[Compute init](experimental/compute-init.md)
38+
39+
[Pulp](experimental/pulp.md)
40+
41+
[Slurm controlled rebuild](experimental/slurm-controlled-rebuild.md)
42+
43+
### Contributor docs
44+
45+
[Adding functionality](adding-functionality.md)

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 9 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -107,42 +107,17 @@ The configuration of this is complex and involves:
107107
defined in the `compute` or `login` variables, to override the default
108108
image for specific node groups.
109109
110-
5. Modify `openhpc_slurm_partitions` to add a new partition covering rebuildable
111-
nodes to use for for rebuild jobs. If using the default OpenTofu
112-
configurations, this variable is contained in an OpenTofu-templated file
113-
`environments/$ENV/group_vars/all/partitions.yml` which must be overriden
114-
by copying it to e.g. a `z_partitions.yml` file in the same directory.
115-
However production sites will probably be overriding this file anyway to
116-
customise it.
117-
118-
An example partition definition, given the two node groups "general" and
119-
"gpu" shown in Step 2, is:
120-
121-
```yaml
122-
openhpc_slurm_partitions:
123-
...
124-
- name: rebuild
125-
groups:
126-
- name: general
127-
- name: gpu
128-
default: NO
129-
maxtime: 30
130-
partition_params:
131-
PriorityJobFactor: 65533
132-
Hidden: YES
133-
RootOnly: YES
134-
DisableRootJobs: NO
135-
PreemptMode: 'OFF'
136-
OverSubscribe: EXCLUSIVE
137-
```
138-
139-
Which has parameters as follows:
110+
5. Ensure `openhpc_partitions` contains a partition covering the nodes to run
111+
rebuild jobs. The default definition in `environments/common/inventory/group_vars/all/openhpc.yml`
112+
will automatically include this via `openhpc_rebuild_partition` also in that
113+
file. If modifying this, note the important parameters are:
114+
140115
- `name`: Partition name matching `rebuild` role variable `rebuild_partitions`,
141116
default `rebuild`.
142-
- `groups`: A list of node group names, matching keys in the OpenTofu
143-
`compute` variable (see example in step 2 above). Normally every compute
144-
node group should be listed here, unless Slurm-controlled rebuild is not
145-
required for certain node groups.
117+
- `groups`: A list of nodegroup names, matching `openhpc_nodegroup` and
118+
keys in the OpenTofu `compute` variable (see example in step 2 above).
119+
Normally every compute node group should be listed here, unless
120+
Slurm-controlled rebuild is not required for certain node groups.
146121
- `default`: Must be set to `NO` so that it is not the default partition.
147122
- `maxtime`: Maximum time to allow for rebuild jobs, in
148123
[slurm.conf format](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).

0 commit comments

Comments
 (0)