You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+71Lines changed: 71 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,9 +59,17 @@ unique set of homogenous nodes:
59
59
`free --mebi` total * `openhpc_ram_multiplier`.
60
60
*`ram_multiplier`: Optional. An override for the top-level definition
61
61
`openhpc_ram_multiplier`. Has no effect if `ram_mb` is set.
62
+
<<<<<<< HEAD
62
63
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict must define:
63
64
- `conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
64
65
- `file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66
+
=======
67
+
*`gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key. See [GRES autodetection](#gres-autodetection) section below.
68
+
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
69
+
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
70
+
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
71
+
72
+
>>>>>>> master
65
73
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
66
74
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
67
75
*`node_params`: Optional. Mapping of additional parameters and values for
@@ -277,7 +285,20 @@ openhpc_nodegroups:
277
285
- conf: gpu:A100:2
278
286
file: /dev/nvidia[0-1]
279
287
```
288
+
or if using the NVML gres_autodection mechamism (NOTE: this requires recompilation of the slurm binaries to link against the [NVIDIA Management libray](#gres-autodetection)):
280
289
290
+
```yaml
291
+
openhpc_cluster_name: hpc
292
+
openhpc_nodegroups:
293
+
- name: general
294
+
- name: large
295
+
node_params:
296
+
CoreSpecCount: 2
297
+
- name: gpu
298
+
gres_autodetect: nvml
299
+
gres:
300
+
- conf: gpu:A100:2
301
+
```
281
302
Now two partitions can be configured - a default one with a short timelimit and
282
303
no large memory nodes for testing jobs, and another with all hardware and longer
283
304
job runtime for "production" jobs:
@@ -309,4 +330,54 @@ openhpc_config:
309
330
-gpu
310
331
```
311
332
333
+
## GRES autodetection
334
+
335
+
Some autodetection mechanisms require recompilation of the slurm packages to
336
+
link against external libraries. Examples are shown in the sections below.
337
+
338
+
### Recompiling slurm binaries against the [NVIDIA Management libray](https://developer.nvidia.com/management-library-nvml)
339
+
340
+
This will allow you to use `gres_autodetect: nvml` in your `nodegroup`
341
+
definitions.
342
+
343
+
First, [install the complete cuda toolkit from NVIDIA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/).
344
+
You can then recompile the slurm packages from the source RPMS as follows:
345
+
346
+
```sh
347
+
dnf download --source slurm-slurmd-ohpc
348
+
349
+
rpm -i slurm-ohpc-*.src.rpm
350
+
351
+
cd /root/rpmbuild/SPECS
352
+
353
+
dnf builddep slurm.spec
354
+
355
+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-12.8/targets/x86_64-linux/" slurm.spec | tee /tmp/build.txt
356
+
```
357
+
358
+
NOTE: This will need to be adapted for the version of CUDA installed (12.8 is used in the example).
359
+
360
+
The RPMs will be created in ` /root/rpmbuild/RPMS/x86_64/`. The method to distribute these RPMs to
361
+
each compute node is out of scope of this document. You can either use a custom package repository
362
+
or simply install them manually on each node with Ansible.
363
+
364
+
#### Configuration example
365
+
366
+
A configuration snippet is shown below:
367
+
368
+
```yaml
369
+
openhpc_cluster_name: hpc
370
+
openhpc_nodegroups:
371
+
- name: general
372
+
- name: large
373
+
node_params:
374
+
CoreSpecCount: 2
375
+
- name: gpu
376
+
gres_autodetect: nvml
377
+
gres:
378
+
- conf: gpu:A100:2
379
+
```
380
+
for additional context refer to the GPU example in: [Multiple Nodegroups](#multiple-nodegroups).
381
+
382
+
312
383
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
NodeName={{ hostlist_string }} Name={{ gres_name }} Type={{ gres_type }} File={{ gres.file | mandatory('The gres configuration dictionary: ' ~ gres ~ ' is missing the file key, but gres_autodetect is set to off. The error occured on node group: ' ~ nodegroup.name ~ '. Please add the file key or set gres_autodetect.') }}
0 commit comments