You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+47-2Lines changed: 47 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,9 +59,10 @@ unique set of homogenous nodes:
59
59
`free --mebi` total * `openhpc_ram_multiplier`.
60
60
*`ram_multiplier`: Optional. An override for the top-level definition
61
61
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
62
-
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
62
+
*`gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
63
+
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
63
64
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
64
-
-`file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65
+
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65
66
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
66
67
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
67
68
*`node_params`: Optional. Mapping of additional parameters and values for
@@ -277,7 +278,20 @@ openhpc_nodegroups:
277
278
- conf: gpu:A100:2
278
279
file: /dev/nvidia[0-1]
279
280
```
281
+
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
280
282
283
+
```yaml
284
+
openhpc_cluster_name: hpc
285
+
openhpc_nodegroups:
286
+
- name: general
287
+
- name: large
288
+
node_params:
289
+
CoreSpecCount: 2
290
+
- name: gpu
291
+
gres_autodetect: nvml
292
+
gres:
293
+
- conf: gpu:A100:2
294
+
```
281
295
Now two partitions can be configured - a default one with a short timelimit and
282
296
no large memory nodes for testing jobs, and another with all hardware and longer
283
297
job runtime for "production" jobs:
@@ -309,4 +323,35 @@ openhpc_config:
309
323
-gpu
310
324
```
311
325
326
+
## GRES autodetection
327
+
328
+
Some autodetection mechanisms require recompilation of the slurm packages to
329
+
link against external libraries. Examples are shown in the sections below.
330
+
331
+
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
332
+
333
+
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
334
+
definitions.
335
+
336
+
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
337
+
You can then recompile the slurm packages from the source RPMS as follows:
338
+
339
+
```sh
340
+
dnf download --source slurm-slurmd-ohpc
341
+
342
+
rpm -i slurm-ohpc-*.src.rpm
343
+
344
+
cd /root/rpmbuild/SPECS
345
+
346
+
dnf builddep slurm.spec
347
+
348
+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
349
+
```
350
+
351
+
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
352
+
353
+
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
354
+
each compute node is out of scope of this document. You can either use a custom package repository
355
+
or simply install them manually on each node with Ansible.
356
+
312
357
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
NodeName={{ hostlist }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file | mandatory('The gres configuration dictionary: ' ~ gres ~ ' is missing the file key, but gres_autodetect is set to off. The error occured on node group: ' ~ nodegroup.name ~ '. Please add the file key or set gres_autodetect.') }}
0 commit comments