Skip to content

Commit 0f04ca2

Browse files
committed
Improve docs and fix script
1 parent d8968d8 commit 0f04ca2

File tree

2 files changed

+28
-7
lines changed

2 files changed

+28
-7
lines changed

ansible/fatimage.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,15 +261,18 @@
261261
- name: Recompile and install slurm packages
262262
shell: |
263263
#!/bin/bash
264-
dnf download --source slurm-slurmd-ohpc
264+
set -eux
265+
dnf download -y --source slurm-slurmd-ohpc
265266
rpm -i slurm-ohpc-*.src.rpm
266267
dnf install -y @'Development Tools'
267268
cd /root/rpmbuild/SPECS
268269
dnf builddep -y slurm.spec
269-
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-{{ cuda_facts_version_short }}/targets/x86_64-linux/"
270-
dnf reinstall /root/rpmbuild/RPMS/x86_64/*.rpm
270+
rpmbuild -bb -D "_with_nvml --with-nvml=/usr/local/cuda-{{ cuda_facts_version_short }}/targets/x86_64-linux/" slurm.spec
271+
dnf reinstall -y /root/rpmbuild/RPMS/x86_64/*.rpm
271272
# Workaround path issue: https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY
272-
ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so
273+
if [[ -e /lib64/libnvidia-ml.so ]]; then
274+
ln -s /lib64/libnvidia-ml.so.1 /lib64/libnvidia-ml.so
275+
fi
273276
274277
- name: Run post.yml hook
275278
vars:

docs/mig.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,12 @@ For example in: `environments/<environment>/inventory/group_vars/all/vgpu`:
2626
vgpu_definitions:
2727
- pci_address: "0000:17:00.0"
2828
mig_devices:
29-
"1g.10gb": 1
30-
"2g.20gb": 3
29+
"1g.10gb": 4
30+
"4g.40gb": 1
31+
- pci_address: "0000:81:00.0"
32+
mig_devices:
33+
"1g.10gb": 4
34+
"4g.40gb": 1
3135
```
3236

3337
The appliance will use the driver installed via the ``cuda`` role. Use ``lspci`` to determine the PCI
@@ -39,4 +43,18 @@ Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.
3943

4044
## gres configuration
4145

42-
TODO
46+
You should stop terraform templating out partitions.yml and specify `openhpc_slurm_partitions` manually.
47+
An example enabling the nvml autodection mechanism is show below
48+
(`environments/<environment>/inventory/group_vars/all/partitions-manual.yml`):
49+
50+
```
51+
openhpc_slurm_partitions:
52+
- name: cpu
53+
- name: gpu
54+
gres_autodetect: nvml
55+
gres:
56+
# Two cards not partitioned with MIG
57+
- conf: "gpu:nvidia_h100_80gb_hbm3:2"
58+
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
59+
- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
60+
```

0 commit comments

Comments
 (0)