Skip to content

Commit 049b557

Browse files
authored
Update docs
1 parent abf35e5 commit 049b557

File tree

1 file changed

+143
-9
lines changed

1 file changed

+143
-9
lines changed

docs/mig.md

Lines changed: 143 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,31 +34,165 @@ vgpu_definitions:
3434
"4g.40gb": 1
3535
```
3636

37-
The appliance will use the driver installed via the ``cuda`` role. Use ``lspci`` to determine the PCI
38-
addresses.
37+
The appliance will use the driver installed via the ``cuda`` role.
38+
39+
Use ``lspci`` to determine the PCI addresses e.g:
40+
41+
```
42+
[root@io-io-gpu-02 ~]# lspci -nn | grep -i nvidia
43+
06:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
44+
0c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
45+
46:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
46+
4c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
47+
```
48+
49+
The supported profiles can be discovered by consulting the [NVIDIA documentation](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-mig-profiles)
50+
or interactively by running the following on one of the compute nodes with GPU resources:
51+
52+
```
53+
[rocky@io-io-gpu-05 ~]$ sudo nvidia-smi -i 0 -mig 1
54+
Enabled MIG Mode for GPU 00000000:06:00.0
55+
All done.
56+
[rocky@io-io-gpu-05 ~]$ sudo nvidia-smi mig -lgip
57+
+-----------------------------------------------------------------------------+
58+
| GPU instance profiles: |
59+
| GPU Name ID Instances Memory P2P SM DEC ENC |
60+
| Free/Total GiB CE JPEG OFA |
61+
|=============================================================================|
62+
| 0 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
63+
| 1 1 0 |
64+
+-----------------------------------------------------------------------------+
65+
| 0 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
66+
| 1 1 1 |
67+
+-----------------------------------------------------------------------------+
68+
| 0 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
69+
| 1 1 0 |
70+
+-----------------------------------------------------------------------------+
71+
| 0 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
72+
| 2 2 0 |
73+
+-----------------------------------------------------------------------------+
74+
| 0 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
75+
| 3 3 0 |
76+
+-----------------------------------------------------------------------------+
77+
| 0 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
78+
| 4 4 0 |
79+
+-----------------------------------------------------------------------------+
80+
| 0 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
81+
| 8 7 1 |
82+
+-----------------------------------------------------------------------------+
83+
| 1 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
84+
| 1 1 0 |
85+
+-----------------------------------------------------------------------------+
86+
| 1 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
87+
| 1 1 1 |
88+
+-----------------------------------------------------------------------------+
89+
| 1 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
90+
| 1 1 0 |
91+
+-----------------------------------------------------------------------------+
92+
| 1 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
93+
| 2 2 0 |
94+
+-----------------------------------------------------------------------------+
95+
| 1 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
96+
| 3 3 0 |
97+
+-----------------------------------------------------------------------------+
98+
| 1 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
99+
| 4 4 0 |
100+
+-----------------------------------------------------------------------------+
101+
| 1 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
102+
| 8 7 1 |
103+
+-----------------------------------------------------------------------------+
104+
| 2 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
105+
| 1 1 0 |
106+
+-----------------------------------------------------------------------------+
107+
| 2 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
108+
| 1 1 1 |
109+
+-----------------------------------------------------------------------------+
110+
| 2 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
111+
| 1 1 0 |
112+
+-----------------------------------------------------------------------------+
113+
| 2 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
114+
| 2 2 0 |
115+
+-----------------------------------------------------------------------------+
116+
| 2 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
117+
| 3 3 0 |
118+
+-----------------------------------------------------------------------------+
119+
| 2 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
120+
| 4 4 0 |
121+
+-----------------------------------------------------------------------------+
122+
| 2 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
123+
| 8 7 1 |
124+
+-----------------------------------------------------------------------------+
125+
| 3 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
126+
| 1 1 0 |
127+
+-----------------------------------------------------------------------------+
128+
| 3 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
129+
| 1 1 1 |
130+
+-----------------------------------------------------------------------------+
131+
| 3 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
132+
| 1 1 0 |
133+
+-----------------------------------------------------------------------------+
134+
| 3 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
135+
| 2 2 0 |
136+
+-----------------------------------------------------------------------------+
137+
| 3 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
138+
| 3 3 0 |
139+
+-----------------------------------------------------------------------------+
140+
| 3 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
141+
| 4 4 0 |
142+
+-----------------------------------------------------------------------------+
143+
| 3 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
144+
| 8 7 1 |
145+
+-----------------------------------------------------------------------------+
146+
```
39147

40148
## compute_init
41149

42150
Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.
43151

44-
## gres configuration
152+
## GRES configuration
45153

46-
Enable gres autodetection. This can be set as a host or group var.
154+
You should stop terraform templating out partitions.yml and specify `openhpc_nodegroups` manually.
155+
156+
GPU types can be determined by deploying slurm without any gres configuration and then running
157+
`sudo slurmd -G` on a compute node where GPU resources exist. An example is shown below:
47158

48159
```
49-
openhpc_gres_autodetect: nvml
160+
[rocky@io-io-gpu-02 ~]$ sudo slurmd -G
161+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
162+
,ENV_OPENCL,ENV_DEFAULT
163+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
164+
,ENV_OPENCL,ENV_DEFAULT
165+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=291 ID=7696487 File=/dev/nvidia-caps/nvidia-cap291 Links=(null) Flags=HAS_FILE,HAS_TYPE,
166+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
167+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=417 ID=7696487 File=/dev/nvidia-caps/nvidia-cap417 Links=(null) Flags=HAS_FILE,HAS_TYPE,
168+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
169+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=336 ID=7696487 File=/dev/nvidia-caps/nvidia-cap336 Links=(null) Flags=HAS_FILE,HAS_TYPE,
170+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
171+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=345 ID=7696487 File=/dev/nvidia-caps/nvidia-cap345 Links=(null) Flags=HAS_FILE,HAS_TYPE,
172+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
173+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=354 ID=7696487 File=/dev/nvidia-caps/nvidia-cap354 Links=(null) Flags=HAS_FILE,HAS_TYPE,
174+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
175+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=507 ID=7696487 File=/dev/nvidia-caps/nvidia-cap507 Links=(null) Flags=HAS_FILE,HAS_TYPE,
176+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
177+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=516 ID=7696487 File=/dev/nvidia-caps/nvidia-cap516 Links=(null) Flags=HAS_FILE,HAS_TYPE,
178+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
179+
slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=525 ID=7696487 File=/dev/nvidia-caps/nvidia-cap525 Links=(null) Flags=HAS_FILE,HAS_TYPE,
180+
ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
50181
```
51182

52-
You should stop terraform templating out partitions.yml and specify `openhpc_slurm_partitions` manually.
53-
An example of specifying gres resources is shown below
183+
GRES resources can then be configured manually. An example is shown below
54184
(`environments/<environment>/inventory/group_vars/all/partitions-manual.yml`):
55185

56186
```
57-
openhpc_slurm_partitions:
187+
openhpc_partitions:
188+
- name: cpu
189+
- name: gpu
190+
191+
openhpc_nodegroups:
58192
- name: cpu
59193
- name: gpu
194+
gres_autodetect: nvml
60195
gres:
61-
# Two cards not partitioned with MIG
62196
- conf: "gpu:nvidia_h100_80gb_hbm3:2"
63197
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
64198
- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"

0 commit comments

Comments
 (0)