File tree Expand file tree Collapse file tree 2 files changed +28
-7
lines changed Expand file tree Collapse file tree 2 files changed +28
-7
lines changed Original file line number Diff line number Diff line change 261
261
- name : Recompile and install slurm packages
262
262
shell : |
263
263
#!/bin/bash
264
- dnf download --source slurm-slurmd-ohpc
264
+ set -eux
265
+ dnf download -y --source slurm-slurmd-ohpc
265
266
rpm -i slurm-ohpc-*.src.rpm
266
267
dnf install -y @'Development Tools'
267
268
cd /root/rpmbuild/SPECS
268
269
dnf builddep -y slurm.spec
269
- rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-{{ cuda_facts_version_short }}/targets/x86_64-linux/"
270
- dnf reinstall /root/rpmbuild/RPMS/x86_64/*.rpm
270
+ rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-{{ cuda_facts_version_short }}/targets/x86_64-linux/" slurm.spec
271
+ dnf reinstall -y /root/rpmbuild/RPMS/x86_64/*.rpm
271
272
# Workaround path issue: https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY
272
- ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so
273
+ if [[ -e /lib64/libnvidia-ml.so ]]; then
274
+ ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so
275
+ fi
273
276
274
277
- name : Run post.yml hook
275
278
vars :
Original file line number Diff line number Diff line change @@ -26,8 +26,12 @@ For example in: `environments/<environment>/inventory/group_vars/all/vgpu`:
26
26
vgpu_definitions:
27
27
- pci_address: "0000:17:00.0"
28
28
mig_devices:
29
- "1g.10gb": 1
30
- "2g.20gb": 3
29
+ "1g.10gb": 4
30
+ "4g.40gb": 1
31
+ - pci_address: "0000:81:00.0"
32
+ mig_devices:
33
+ "1g.10gb": 4
34
+ "4g.40gb": 1
31
35
```
32
36
33
37
The appliance will use the driver installed via the `` cuda `` role. Use `` lspci `` to determine the PCI
@@ -39,4 +43,18 @@ Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.
39
43
40
44
## gres configuration
41
45
42
- TODO
46
+ You should stop terraform templating out partitions.yml and specify ` openhpc_slurm_partitions ` manually.
47
+ An example enabling the nvml autodection mechanism is show below
48
+ (` environments/<environment>/inventory/group_vars/all/partitions-manual.yml ` ):
49
+
50
+ ```
51
+ openhpc_slurm_partitions:
52
+ - name: cpu
53
+ - name: gpu
54
+ gres_autodetect: nvml
55
+ gres:
56
+ # Two cards not partitioned with MIG
57
+ - conf: "gpu:nvidia_h100_80gb_hbm3:2"
58
+ - conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
59
+ - conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
60
+ ```
You can’t perform that action at this time.
0 commit comments