Skip to content

Commit b4c0178

Browse files
committed
Merge branch 'master' into feat/simpler-templating
2 parents 04f3bbb + 55d8af4 commit b4c0178

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+545
-306
lines changed

.github/workflows/ci.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,6 @@ jobs:
5959
- test11
6060
- test12
6161
- test13
62-
- test14
6362
exclude:
6463
# mariadb package provides /usr/bin/mysql on RL8 which doesn't work with geerlingguy/mysql role
6564
- scenario: test4

README.md

Lines changed: 222 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -59,15 +59,20 @@ unique set of homogenous nodes:
5959
`free --mebi` total * `openhpc_ram_multiplier`.
6060
* `ram_multiplier`: Optional. An override for the top-level definition
6161
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
62-
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
62+
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. See [GRES autodetection](#gres-autodetection) section below.
63+
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
6364
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
64-
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65+
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66+
6567
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
66-
* `params`: Optional. Mapping of additional parameters and values for
68+
* `features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
69+
* `node_params`: Optional. Mapping of additional parameters and values for
6770
[node configuration](https://slurm.schedmd.com/slurm.conf.html#lbAE).
71+
**NB:** Parameters which can be set via the keys above must not be included here.
6872

6973
Each nodegroup will contain hosts from an Ansible inventory group named
70-
`{{ openhpc_cluster_name }}_{{ group_name}}`. Note that:
74+
`{{ openhpc_cluster_name }}_{{ name }}`, where `name` is the nodegroup name.
75+
Note that:
7176
- Each host may only appear in one nodegroup.
7277
- Hosts in a nodegroup are assumed to be homogenous in terms of processor and memory.
7378
- Hosts may have arbitrary hostnames, but these should be lowercase to avoid a
@@ -78,18 +83,23 @@ unique set of homogenous nodes:
7883
This is used to set `Sockets`, `CoresPerSocket`, `ThreadsPerCore` and
7984
optionally `RealMemory` for the nodegroup.
8085

81-
`openhpc_partitions`: Optional, default `[]`. List of mappings, each defining a
86+
`openhpc_partitions`: Optional. List of mappings, each defining a
8287
partition. Each partition mapping may contain:
8388
* `name`: Required. Name of partition.
84-
* `groups`: Optional. List of nodegroup names. If omitted, the partition name
85-
is assumed to match a nodegroup name.
89+
* `nodegroups`: Optional. List of node group names. If omitted, the node group
90+
with the same name as the partition is used.
8691
* `default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
87-
* `maxtime`: Optional. A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`. The default value is
88-
given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
89-
* `params`: Optional. Mapping of additional parameters and values for
92+
* `maxtime`: Optional. A partition-specific time limit overriding `openhpc_job_maxtime`.
93+
* `partition_params`: Optional. Mapping of additional parameters and values for
9094
[partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION).
95+
**NB:** Parameters which can be set via the keys above must not be included here.
96+
97+
If this variable is not set one partition per nodegroup is created, with default
98+
partition configuration for each.
9199

92-
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days). See [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime` for format. The default is 60 days. The value should be quoted to avoid Ansible conversions.
100+
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see
101+
[slurm.conf:MaxTime](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
102+
**NB:** This should be quoted to avoid Ansible conversions.
93103

94104
`openhpc_cluster_name`: name of the cluster.
95105

@@ -140,10 +150,12 @@ accounting data such as start and end times. By default no job accounting is con
140150
`openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of
141151
`openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk.
142152

143-
### slurmdbd.conf
153+
### slurmdbd
144154

145-
The following options affect `slurmdbd.conf`. Please see the slurm [documentation](https://slurm.schedmd.com/slurmdbd.conf.html) for more details.
146-
You will need to configure these variables if you have set `openhpc_enable.database` to `true`.
155+
When the slurm database daemon (`slurmdbd`) is enabled by setting
156+
`openhpc_enable.database` to `true` the following options must be configured.
157+
See documentation for [slurmdbd.conf](https://slurm.schedmd.com/slurmdbd.conf.html)
158+
for more details.
147159

148160
`openhpc_slurmdbd_port`: Port for slurmdb to listen on, defaults to `6819`.
149161

@@ -155,6 +167,30 @@ You will need to configure these variables if you have set `openhpc_enable.datab
155167

156168
`openhpc_slurmdbd_mysql_username`: Username for authenticating with the database, defaults to `slurm`.
157169

170+
Before starting `slurmdbd`, the role will check if a database upgrade is
171+
required to due to a Slurm major version upgrade and carry it out if so.
172+
Slurm versions before 24.11 do not support this check and so no upgrade will
173+
occur. The following variables control behaviour during this upgrade:
174+
175+
`openhpc_slurm_accounting_storage_client_package`: Optional. String giving the
176+
name of the database client package to install, e.g. `mariadb`. Default `mysql`.
177+
178+
`openhpc_slurm_accounting_storage_backup_cmd`: Optional. String (possibly
179+
multi-line) giving a command for `ansible.builtin.shell` to run a backup of the
180+
Slurm database before performing the databse upgrade. Default is the empty
181+
string which performs no backup.
182+
183+
`openhpc_slurm_accounting_storage_backup_host`: Optional. Inventory hostname
184+
defining host to run the backup command. Default is `openhpc_slurm_accounting_storage_host`.
185+
186+
`openhpc_slurm_accounting_storage_backup_become`: Optional. Whether to run the
187+
backup command as root. Default `true`.
188+
189+
`openhpc_slurm_accounting_storage_service`: Optional. Name of systemd service
190+
for the accounting storage database, e.g. `mysql`. If this is defined this
191+
service is stopped before the backup and restarted after, to allow for physical
192+
backups. Default is the empty string, which does not stop/restart any service.
193+
158194
## Facts
159195

160196
This role creates local facts from the live Slurm configuration, which can be
@@ -163,50 +199,184 @@ accessed (with facts gathering enabled) using `ansible_local.slurm`. As per the
163199
in mixed case are from from config files. Note the facts are only refreshed
164200
when this role is run.
165201

166-
## Example Inventory
167-
168-
And an Ansible inventory as this:
169-
170-
[openhpc_login]
171-
openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos
172-
173-
[openhpc_compute]
174-
openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
175-
openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos
202+
## Example
176203

177-
[cluster_login:children]
178-
openhpc_login
204+
### Simple
179205

180-
[cluster_control:children]
181-
openhpc_login
206+
The following creates a cluster with a a single partition `compute`
207+
containing two nodes:
182208

183-
[cluster_batch:children]
184-
openhpc_compute
209+
```ini
210+
# inventory/hosts:
211+
[hpc_login]
212+
cluster-login-0
185213

186-
## Example Playbooks
214+
[hpc_compute]
215+
cluster-compute-0
216+
cluster-compute-1
187217

188-
To deploy, create a playbook which looks like this:
189-
190-
---
191-
- hosts:
192-
- cluster_login
193-
- cluster_control
194-
- cluster_batch
195-
become: yes
196-
roles:
197-
- role: openhpc
198-
openhpc_enable:
199-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
200-
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
201-
runtime: true
202-
openhpc_slurm_service_enabled: true
203-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
204-
openhpc_slurm_partitions:
205-
- name: "compute"
206-
openhpc_cluster_name: openhpc
207-
openhpc_packages: []
208-
...
218+
[hpc_control]
219+
cluster-control
220+
```
209221

222+
```yaml
223+
#playbook.yml
224+
---
225+
- hosts: all
226+
become: yes
227+
tasks:
228+
- import_role:
229+
name: stackhpc.openhpc
230+
vars:
231+
openhpc_cluster_name: hpc
232+
openhpc_enable:
233+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
234+
batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
235+
runtime: true
236+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
237+
openhpc_nodegroups:
238+
- name: compute
239+
openhpc_partitions:
240+
- name: compute
210241
---
242+
```
243+
244+
### Multiple nodegroups
245+
246+
This example shows how partitions can span multiple types of compute node.
247+
248+
This example inventory describes three types of compute node (login and
249+
control nodes are omitted for brevity):
250+
251+
```ini
252+
# inventory/hosts:
253+
...
254+
[hpc_general]
255+
# standard compute nodes
256+
cluster-general-0
257+
cluster-general-1
258+
259+
[hpc_large]
260+
# large memory nodes
261+
cluster-largemem-0
262+
cluster-largemem-1
263+
264+
[hpc_gpu]
265+
# GPU nodes
266+
cluster-a100-0
267+
cluster-a100-1
268+
...
269+
```
270+
271+
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
272+
apply any node-level parameters - in this case the `largemem` nodes have
273+
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
274+
275+
```yaml
276+
openhpc_cluster_name: hpc
277+
openhpc_nodegroups:
278+
- name: general
279+
- name: large
280+
node_params:
281+
CoreSpecCount: 2
282+
- name: gpu
283+
gres:
284+
- conf: gpu:A100:2
285+
file: /dev/nvidia[0-1]
286+
```
287+
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
288+
289+
```yaml
290+
openhpc_cluster_name: hpc
291+
openhpc_nodegroups:
292+
- name: general
293+
- name: large
294+
node_params:
295+
CoreSpecCount: 2
296+
- name: gpu
297+
gres_autodetect: nvml
298+
gres:
299+
- conf: gpu:A100:2
300+
```
301+
Now two partitions can be configured - a default one with a short timelimit and
302+
no large memory nodes for testing jobs, and another with all hardware and longer
303+
job runtime for "production" jobs:
304+
305+
```yaml
306+
openhpc_partitions:
307+
- name: test
308+
nodegroups:
309+
- general
310+
- gpu
311+
maxtime: '1:0:0' # 1 hour
312+
default: 'YES'
313+
- name: general
314+
nodegroups:
315+
- general
316+
- large
317+
- gpu
318+
maxtime: '2-0' # 2 days
319+
default: 'NO'
320+
```
321+
Users will select the partition using `--partition` argument and request nodes
322+
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
323+
options for `sbatch` or `srun`.
324+
325+
Finally here some additional configuration must be provided for GRES:
326+
```yaml
327+
openhpc_config:
328+
GresTypes:
329+
-gpu
330+
```
331+
332+
## GRES autodetection
333+
334+
Some autodetection mechanisms require recompilation of the slurm packages to
335+
link against external libraries. Examples are shown in the sections below.
336+
337+
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
338+
339+
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
340+
definitions.
341+
342+
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
343+
You can then recompile the slurm packages from the source RPMS as follows:
344+
345+
```sh
346+
dnf download --source slurm-slurmd-ohpc
347+
348+
rpm -i slurm-ohpc-*.src.rpm
349+
350+
cd /root/rpmbuild/SPECS
351+
352+
dnf builddep slurm.spec
353+
354+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
355+
```
356+
357+
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
358+
359+
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
360+
each compute node is out of scope of this document. You can either use a custom package repository
361+
or simply install them manually on each node with Ansible.
362+
363+
#### Configuration example
364+
365+
A configuration snippet is shown below:
366+
367+
```yaml
368+
openhpc_cluster_name: hpc
369+
openhpc_nodegroups:
370+
- name: general
371+
- name: large
372+
node_params:
373+
CoreSpecCount: 2
374+
- name: gpu
375+
gres_autodetect: nvml
376+
gres:
377+
- conf: gpu:A100:2
378+
```
379+
for additional context refer to the GPU example in: [Multiple Nodegroups](#multiple-nodegroups).
380+
211381

212382
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)

defaults/main.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ openhpc_slurm_service_started: "{{ openhpc_slurm_service_enabled }}"
44
openhpc_slurm_service:
55
openhpc_slurm_control_host: "{{ inventory_hostname }}"
66
#openhpc_slurm_control_host_address:
7-
openhpc_partitions: []
7+
openhpc_partitions: "{{ openhpc_nodegroups }}"
88
openhpc_nodegroups: []
99
openhpc_cluster_name:
1010
openhpc_packages:
@@ -132,3 +132,10 @@ openhpc_module_system_install: true
132132

133133
# Auto detection
134134
openhpc_ram_multiplier: 0.95
135+
136+
# Database upgrade
137+
openhpc_slurm_accounting_storage_service: ''
138+
openhpc_slurm_accounting_storage_backup_cmd: ''
139+
openhpc_slurm_accounting_storage_backup_host: "{{ openhpc_slurm_accounting_storage_host }}"
140+
openhpc_slurm_accounting_storage_backup_become: true
141+
openhpc_slurm_accounting_storage_client_package: mysql

handlers/main.yml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,4 @@
11
---
2-
# NOTE: We need this running before slurmdbd
3-
- name: Restart Munge service
4-
service:
5-
name: "munge"
6-
state: restarted
7-
when: openhpc_slurm_service_started | bool
82

93
# NOTE: we need this running before slurmctld start
104
- name: Issue slurmdbd restart command

molecule/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ test1 | 1 | N | 2x compute node, sequential na
1010
test1b | 1 | N | 1x compute node
1111
test1c | 1 | N | 2x compute nodes, nonsequential names
1212
test2 | 2 | N | 4x compute node, sequential names
13-
test3 | 1 | Y | -
13+
test3 | 1 | Y | 4x compute nodes in 2x groups, single partition
1414
test4 | 1 | N | 2x compute node, accounting enabled
1515
test5 | 1 | N | As for #1 but configless
1616
test6 | 1 | N | 0x compute nodes, configless
@@ -21,7 +21,7 @@ test10 | 1 | N | As for #5 but then tries to ad
2121
test11 | 1 | N | As for #5 but then deletes a node (actually changes the partition due to molecule/ansible limitations)
2222
test12 | 1 | N | As for #5 but enabling job completion and testing `sacct -c`
2323
test13 | 1 | N | As for #5 but tests `openhpc_config` variable.
24-
test14 | 1 | N | As for #5 but also tests `extra_nodes` via State=DOWN nodes.
24+
test14 | 1 | N | [removed, extra_nodes removed]
2525
test15 | 1 | Y | As for #5 but also tests `partitions with different name but with the same NodeName`.
2626

2727

molecule/test1/converge.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
batch: "{{ inventory_hostname in groups['testohpc_compute'] }}"
88
runtime: true
99
openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}"
10-
openhpc_slurm_partitions:
10+
openhpc_nodegroups:
1111
- name: "compute"
1212
openhpc_cluster_name: testohpc
1313
tasks:

0 commit comments

Comments
 (0)