Skip to content

Commit 1ca4a4e

Browse files
committed
Add support for autodetection of gres resources
1 parent 5ceb9e1 commit 1ca4a4e

File tree

2 files changed

+61
-8
lines changed

2 files changed

+61
-8
lines changed

README.md

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,9 +59,10 @@ unique set of homogenous nodes:
5959
`free --mebi` total * `openhpc_ram_multiplier`.
6060
* `ram_multiplier`: Optional. An override for the top-level definition
6161
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
62-
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
62+
* `gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
63+
* `gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
6364
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
64-
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65+
- `file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
6566
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
6667
* `features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
6768
* `node_params`: Optional. Mapping of additional parameters and values for
@@ -277,7 +278,20 @@ openhpc_nodegroups:
277278
- conf: gpu:A100:2
278279
file: /dev/nvidia[0-1]
279280
```
281+
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
280282
283+
```yaml
284+
openhpc_cluster_name: hpc
285+
openhpc_nodegroups:
286+
- name: general
287+
- name: large
288+
node_params:
289+
CoreSpecCount: 2
290+
- name: gpu
291+
gres_autodetect: nvml
292+
gres:
293+
- conf: gpu:A100:2
294+
```
281295
Now two partitions can be configured - a default one with a short timelimit and
282296
no large memory nodes for testing jobs, and another with all hardware and longer
283297
job runtime for "production" jobs:
@@ -309,4 +323,35 @@ openhpc_config:
309323
-gpu
310324
```
311325

326+
## GRES autodetection
327+
328+
Some autodetection mechanisms require recompilation of the slurm packages to
329+
link against external libraries. Examples are shown in the sections below.
330+
331+
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
332+
333+
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
334+
definitions.
335+
336+
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
337+
You can then recompile the slurm packages from the source RPMS as follows:
338+
339+
```sh
340+
dnf download --source slurm-slurmd-ohpc
341+
342+
rpm -i slurm-ohpc-*.src.rpm
343+
344+
cd /root/rpmbuild/SPECS
345+
346+
dnf builddep slurm.spec
347+
348+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
349+
```
350+
351+
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
352+
353+
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
354+
each compute node is out of scope of this document. You can either use a custom package repository
355+
or simply install them manually on each node with Ansible.
356+
312357
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)

templates/gres.conf.j2

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
11
AutoDetect=off
22
{% for nodegroup in openhpc_nodegroups %}
3-
{% for gres in nodegroup.gres | default([]) %}
4-
{% set gres_name, gres_type, _ = gres.conf.split(':') %}
5-
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %}
6-
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %}
3+
{% set gres_list = nodegroup.gres | default([]) %}
4+
{% set gres_autodetect = nodegroup.gres_autodetect | default('off') %}
5+
{% set inventory_group_name = openhpc_cluster_name ~ '_' ~ nodegroup.name %}
6+
{% set inventory_group_hosts = groups.get(inventory_group_name, []) %}
7+
{% if gres_autodetect | default('off') != 'off' %}
78
{% for hostlist in (inventory_group_hosts | hostlist_expression) %}
8-
NodeName={{ hostlist }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file }}
9+
NodeName={{ hostlist }} AutoDetect={{ gres_autodetect }}
910
{% endfor %}{# hostlists #}
10-
{% endfor %}{# gres #}
11+
{% else %}
12+
{% for gres in gres_list %}
13+
{% set gres_name, gres_type, _ = gres.conf.split(':') %}
14+
{% for hostlist in (inventory_group_hosts | hostlist_expression) %}
15+
NodeName={{ hostlist }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file | mandatory('The gres configuration dictionary: ' ~ gres ~ ' is missing the file key, but gres_autodetect is set to off. The error occured on node group: ' ~ nodegroup.name ~ '. Please add the file key or set gres_autodetect.') }}
16+
{% endfor %}{# hostlists #}
17+
{% endfor %}{# gres #}
18+
{% endif %}{# autodetect #}
1119
{% endfor %}{# nodegroup #}

0 commit comments

Comments
 (0)