@@ -34,31 +34,165 @@ vgpu_definitions:
34
34
"4g.40gb": 1
35
35
```
36
36
37
- The appliance will use the driver installed via the `` cuda `` role. Use `` lspci `` to determine the PCI
38
- addresses.
37
+ The appliance will use the driver installed via the `` cuda `` role.
38
+
39
+ Use `` lspci `` to determine the PCI addresses e.g:
40
+
41
+ ```
42
+ [root@io-io-gpu-02 ~]# lspci -nn | grep -i nvidia
43
+ 06:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
44
+ 0c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
45
+ 46:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
46
+ 4c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
47
+ ```
48
+
49
+ The supported profiles can be discovered by consulting the [ NVIDIA documentation] ( https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-mig-profiles )
50
+ or interactively by running the following on one of the compute nodes with GPU resources:
51
+
52
+ ```
53
+ [rocky@io-io-gpu-05 ~]$ sudo nvidia-smi -i 0 -mig 1
54
+ Enabled MIG Mode for GPU 00000000:06:00.0
55
+ All done.
56
+ [rocky@io-io-gpu-05 ~]$ sudo nvidia-smi mig -lgip
57
+ +-----------------------------------------------------------------------------+
58
+ | GPU instance profiles: |
59
+ | GPU Name ID Instances Memory P2P SM DEC ENC |
60
+ | Free/Total GiB CE JPEG OFA |
61
+ |=============================================================================|
62
+ | 0 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
63
+ | 1 1 0 |
64
+ +-----------------------------------------------------------------------------+
65
+ | 0 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
66
+ | 1 1 1 |
67
+ +-----------------------------------------------------------------------------+
68
+ | 0 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
69
+ | 1 1 0 |
70
+ +-----------------------------------------------------------------------------+
71
+ | 0 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
72
+ | 2 2 0 |
73
+ +-----------------------------------------------------------------------------+
74
+ | 0 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
75
+ | 3 3 0 |
76
+ +-----------------------------------------------------------------------------+
77
+ | 0 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
78
+ | 4 4 0 |
79
+ +-----------------------------------------------------------------------------+
80
+ | 0 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
81
+ | 8 7 1 |
82
+ +-----------------------------------------------------------------------------+
83
+ | 1 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
84
+ | 1 1 0 |
85
+ +-----------------------------------------------------------------------------+
86
+ | 1 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
87
+ | 1 1 1 |
88
+ +-----------------------------------------------------------------------------+
89
+ | 1 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
90
+ | 1 1 0 |
91
+ +-----------------------------------------------------------------------------+
92
+ | 1 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
93
+ | 2 2 0 |
94
+ +-----------------------------------------------------------------------------+
95
+ | 1 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
96
+ | 3 3 0 |
97
+ +-----------------------------------------------------------------------------+
98
+ | 1 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
99
+ | 4 4 0 |
100
+ +-----------------------------------------------------------------------------+
101
+ | 1 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
102
+ | 8 7 1 |
103
+ +-----------------------------------------------------------------------------+
104
+ | 2 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
105
+ | 1 1 0 |
106
+ +-----------------------------------------------------------------------------+
107
+ | 2 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
108
+ | 1 1 1 |
109
+ +-----------------------------------------------------------------------------+
110
+ | 2 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
111
+ | 1 1 0 |
112
+ +-----------------------------------------------------------------------------+
113
+ | 2 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
114
+ | 2 2 0 |
115
+ +-----------------------------------------------------------------------------+
116
+ | 2 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
117
+ | 3 3 0 |
118
+ +-----------------------------------------------------------------------------+
119
+ | 2 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
120
+ | 4 4 0 |
121
+ +-----------------------------------------------------------------------------+
122
+ | 2 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
123
+ | 8 7 1 |
124
+ +-----------------------------------------------------------------------------+
125
+ | 3 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
126
+ | 1 1 0 |
127
+ +-----------------------------------------------------------------------------+
128
+ | 3 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
129
+ | 1 1 1 |
130
+ +-----------------------------------------------------------------------------+
131
+ | 3 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
132
+ | 1 1 0 |
133
+ +-----------------------------------------------------------------------------+
134
+ | 3 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
135
+ | 2 2 0 |
136
+ +-----------------------------------------------------------------------------+
137
+ | 3 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
138
+ | 3 3 0 |
139
+ +-----------------------------------------------------------------------------+
140
+ | 3 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
141
+ | 4 4 0 |
142
+ +-----------------------------------------------------------------------------+
143
+ | 3 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
144
+ | 8 7 1 |
145
+ +-----------------------------------------------------------------------------+
146
+ ```
39
147
40
148
## compute_init
41
149
42
150
Use the `` vgpu `` metadata option to enable creation of mig devices on rebuild.
43
151
44
- ## gres configuration
152
+ ## GRES configuration
45
153
46
- Enable gres autodetection. This can be set as a host or group var.
154
+ You should stop terraform templating out partitions.yml and specify ` openhpc_nodegroups ` manually.
155
+
156
+ GPU types can be determined by deploying slurm without any gres configuration and then running
157
+ ` sudo slurmd -G ` on a compute node where GPU resources exist. An example is shown below:
47
158
48
159
```
49
- openhpc_gres_autodetect: nvml
160
+ [rocky@io-io-gpu-02 ~]$ sudo slurmd -G
161
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
162
+ ,ENV_OPENCL,ENV_DEFAULT
163
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
164
+ ,ENV_OPENCL,ENV_DEFAULT
165
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=291 ID=7696487 File=/dev/nvidia-caps/nvidia-cap291 Links=(null) Flags=HAS_FILE,HAS_TYPE,
166
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
167
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=417 ID=7696487 File=/dev/nvidia-caps/nvidia-cap417 Links=(null) Flags=HAS_FILE,HAS_TYPE,
168
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
169
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=336 ID=7696487 File=/dev/nvidia-caps/nvidia-cap336 Links=(null) Flags=HAS_FILE,HAS_TYPE,
170
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
171
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=345 ID=7696487 File=/dev/nvidia-caps/nvidia-cap345 Links=(null) Flags=HAS_FILE,HAS_TYPE,
172
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
173
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=354 ID=7696487 File=/dev/nvidia-caps/nvidia-cap354 Links=(null) Flags=HAS_FILE,HAS_TYPE,
174
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
175
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=507 ID=7696487 File=/dev/nvidia-caps/nvidia-cap507 Links=(null) Flags=HAS_FILE,HAS_TYPE,
176
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
177
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=516 ID=7696487 File=/dev/nvidia-caps/nvidia-cap516 Links=(null) Flags=HAS_FILE,HAS_TYPE,
178
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
179
+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=525 ID=7696487 File=/dev/nvidia-caps/nvidia-cap525 Links=(null) Flags=HAS_FILE,HAS_TYPE,
180
+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
50
181
```
51
182
52
- You should stop terraform templating out partitions.yml and specify ` openhpc_slurm_partitions ` manually.
53
- An example of specifying gres resources is shown below
183
+ GRES resources can then be configured manually. An example is shown below
54
184
(` environments/<environment>/inventory/group_vars/all/partitions-manual.yml ` ):
55
185
56
186
```
57
- openhpc_slurm_partitions:
187
+ openhpc_partitions:
188
+ - name: cpu
189
+ - name: gpu
190
+
191
+ openhpc_nodegroups:
58
192
- name: cpu
59
193
- name: gpu
194
+ gres_autodetect: nvml
60
195
gres:
61
- # Two cards not partitioned with MIG
62
196
- conf: "gpu:nvidia_h100_80gb_hbm3:2"
63
197
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
64
198
- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
0 commit comments