@@ -34,31 +34,165 @@ vgpu_definitions:
3434 "4g.40gb": 1
3535```
3636
37- The appliance will use the driver installed via the `` cuda `` role. Use `` lspci `` to determine the PCI
38- addresses.
37+ The appliance will use the driver installed via the `` cuda `` role.
38+
39+ Use `` lspci `` to determine the PCI addresses e.g:
40+
41+ ```
42+ [root@io-io-gpu-02 ~]# lspci -nn | grep -i nvidia
43+ 06:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
44+ 0c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
45+ 46:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
46+ 4c:00.0 3D controller [0302]: NVIDIA Corporation GH100 [H100 SXM5 80GB] [10de:2330] (rev a1)
47+ ```
48+
49+ The supported profiles can be discovered by consulting the [ NVIDIA documentation] ( https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-mig-profiles )
50+ or interactively by running the following on one of the compute nodes with GPU resources:
51+
52+ ```
53+ [rocky@io-io-gpu-05 ~]$ sudo nvidia-smi -i 0 -mig 1
54+ Enabled MIG Mode for GPU 00000000:06:00.0
55+ All done.
56+ [rocky@io-io-gpu-05 ~]$ sudo nvidia-smi mig -lgip
57+ +-----------------------------------------------------------------------------+
58+ | GPU instance profiles: |
59+ | GPU Name ID Instances Memory P2P SM DEC ENC |
60+ | Free/Total GiB CE JPEG OFA |
61+ |=============================================================================|
62+ | 0 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
63+ | 1 1 0 |
64+ +-----------------------------------------------------------------------------+
65+ | 0 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
66+ | 1 1 1 |
67+ +-----------------------------------------------------------------------------+
68+ | 0 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
69+ | 1 1 0 |
70+ +-----------------------------------------------------------------------------+
71+ | 0 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
72+ | 2 2 0 |
73+ +-----------------------------------------------------------------------------+
74+ | 0 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
75+ | 3 3 0 |
76+ +-----------------------------------------------------------------------------+
77+ | 0 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
78+ | 4 4 0 |
79+ +-----------------------------------------------------------------------------+
80+ | 0 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
81+ | 8 7 1 |
82+ +-----------------------------------------------------------------------------+
83+ | 1 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
84+ | 1 1 0 |
85+ +-----------------------------------------------------------------------------+
86+ | 1 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
87+ | 1 1 1 |
88+ +-----------------------------------------------------------------------------+
89+ | 1 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
90+ | 1 1 0 |
91+ +-----------------------------------------------------------------------------+
92+ | 1 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
93+ | 2 2 0 |
94+ +-----------------------------------------------------------------------------+
95+ | 1 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
96+ | 3 3 0 |
97+ +-----------------------------------------------------------------------------+
98+ | 1 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
99+ | 4 4 0 |
100+ +-----------------------------------------------------------------------------+
101+ | 1 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
102+ | 8 7 1 |
103+ +-----------------------------------------------------------------------------+
104+ | 2 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
105+ | 1 1 0 |
106+ +-----------------------------------------------------------------------------+
107+ | 2 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
108+ | 1 1 1 |
109+ +-----------------------------------------------------------------------------+
110+ | 2 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
111+ | 1 1 0 |
112+ +-----------------------------------------------------------------------------+
113+ | 2 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
114+ | 2 2 0 |
115+ +-----------------------------------------------------------------------------+
116+ | 2 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
117+ | 3 3 0 |
118+ +-----------------------------------------------------------------------------+
119+ | 2 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
120+ | 4 4 0 |
121+ +-----------------------------------------------------------------------------+
122+ | 2 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
123+ | 8 7 1 |
124+ +-----------------------------------------------------------------------------+
125+ | 3 MIG 1g.10gb 19 7/7 9.75 No 16 1 0 |
126+ | 1 1 0 |
127+ +-----------------------------------------------------------------------------+
128+ | 3 MIG 1g.10gb+me 20 1/1 9.75 No 16 1 0 |
129+ | 1 1 1 |
130+ +-----------------------------------------------------------------------------+
131+ | 3 MIG 1g.20gb 15 4/4 19.62 No 26 1 0 |
132+ | 1 1 0 |
133+ +-----------------------------------------------------------------------------+
134+ | 3 MIG 2g.20gb 14 3/3 19.62 No 32 2 0 |
135+ | 2 2 0 |
136+ +-----------------------------------------------------------------------------+
137+ | 3 MIG 3g.40gb 9 2/2 39.50 No 60 3 0 |
138+ | 3 3 0 |
139+ +-----------------------------------------------------------------------------+
140+ | 3 MIG 4g.40gb 5 1/1 39.50 No 64 4 0 |
141+ | 4 4 0 |
142+ +-----------------------------------------------------------------------------+
143+ | 3 MIG 7g.80gb 0 1/1 79.25 No 132 7 0 |
144+ | 8 7 1 |
145+ +-----------------------------------------------------------------------------+
146+ ```
39147
40148## compute_init
41149
42150Use the `` vgpu `` metadata option to enable creation of mig devices on rebuild.
43151
44- ## gres configuration
152+ ## GRES configuration
45153
46- Enable gres autodetection. This can be set as a host or group var.
154+ You should stop terraform templating out partitions.yml and specify ` openhpc_nodegroups ` manually.
155+
156+ GPU types can be determined by deploying slurm without any gres configuration and then running
157+ ` sudo slurmd -G ` on a compute node where GPU resources exist. An example is shown below:
47158
48159```
49- openhpc_gres_autodetect: nvml
160+ [rocky@io-io-gpu-02 ~]$ sudo slurmd -G
161+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=0 ID=7696487 File=/dev/nvidia0 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
162+ ,ENV_OPENCL,ENV_DEFAULT
163+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3 Count=1 Index=1 ID=7696487 File=/dev/nvidia1 Links=(null) Flags=HAS_FILE,HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI
164+ ,ENV_OPENCL,ENV_DEFAULT
165+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=291 ID=7696487 File=/dev/nvidia-caps/nvidia-cap291 Links=(null) Flags=HAS_FILE,HAS_TYPE,
166+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
167+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_4g.40gb Count=1 Index=417 ID=7696487 File=/dev/nvidia-caps/nvidia-cap417 Links=(null) Flags=HAS_FILE,HAS_TYPE,
168+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
169+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=336 ID=7696487 File=/dev/nvidia-caps/nvidia-cap336 Links=(null) Flags=HAS_FILE,HAS_TYPE,
170+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
171+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=345 ID=7696487 File=/dev/nvidia-caps/nvidia-cap345 Links=(null) Flags=HAS_FILE,HAS_TYPE,
172+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
173+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=354 ID=7696487 File=/dev/nvidia-caps/nvidia-cap354 Links=(null) Flags=HAS_FILE,HAS_TYPE,
174+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
175+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=507 ID=7696487 File=/dev/nvidia-caps/nvidia-cap507 Links=(null) Flags=HAS_FILE,HAS_TYPE,
176+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
177+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=516 ID=7696487 File=/dev/nvidia-caps/nvidia-cap516 Links=(null) Flags=HAS_FILE,HAS_TYPE,
178+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
179+ slurmd-io-io-gpu-02: Gres Name=gpu Type=nvidia_h100_80gb_hbm3_1g.10gb Count=1 Index=525 ID=7696487 File=/dev/nvidia-caps/nvidia-cap525 Links=(null) Flags=HAS_FILE,HAS_TYPE,
180+ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
50181```
51182
52- You should stop terraform templating out partitions.yml and specify ` openhpc_slurm_partitions ` manually.
53- An example of specifying gres resources is shown below
183+ GRES resources can then be configured manually. An example is shown below
54184(` environments/<environment>/inventory/group_vars/all/partitions-manual.yml ` ):
55185
56186```
57- openhpc_slurm_partitions:
187+ openhpc_partitions:
188+ - name: cpu
189+ - name: gpu
190+
191+ openhpc_nodegroups:
58192 - name: cpu
59193 - name: gpu
194+ gres_autodetect: nvml
60195 gres:
61- # Two cards not partitioned with MIG
62196 - conf: "gpu:nvidia_h100_80gb_hbm3:2"
63197 - conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
64198 - conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
0 commit comments