@@ -151,20 +151,10 @@ Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.
151
151
152
152
## GRES configuration
153
153
154
- You should stop terraform templating out partitions.yml and specify ` openhpc_nodegroups ` manually. To do this
155
- set the ` autogenerated_partitions_enabled ` terraform variable to ` false ` . For example (` environments/production/tofu/main.tf ` ):
156
-
157
- ```
158
- module "cluster" {
159
- source = "../../site/tofu/"
160
- ...
161
- # We manually populate this to add GRES. See environments/site/inventory/group_vars/all/partitions-manual.yml.
162
- autogenerated_partitions_enabled = false
163
- }
164
- ```
165
-
166
- GPU types can be determined by deploying slurm without any gres configuration and then running
167
- ` sudo slurmd -G ` on a compute node where GPU resources exist. An example is shown below:
154
+ GPU resources need to be added to the OpenHPC nodegroup definitions (` openhpc_nodegroups ` ). To
155
+ do this you need to determine the names of the GPU types as detected by slurm. First
156
+ deploy slurm with the default nodegroup definitions to get a working cluster. you will then be
157
+ able to run: ` sudo slurmd -G ` on a compute node where GPU resources exist. An example is shown below:
168
158
169
159
```
170
160
[rocky@io-io-gpu-02 ~]$ sudo slurmd -G
@@ -191,7 +181,7 @@ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
191
181
```
192
182
193
183
GRES resources can then be configured manually. An example is shown below
194
- (` environments/<environment>/inventory/group_vars/all/partitions-manual .yml ` ):
184
+ (` environments/<environment>/inventory/group_vars/all/openhpc .yml ` ):
195
185
196
186
```
197
187
openhpc_partitions:
@@ -207,3 +197,5 @@ openhpc_nodegroups:
207
197
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
208
198
- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
209
199
```
200
+
201
+ Making sure the types (the identifier after ` gpu: ` ) match those collected with ` slurmd -G ` .
0 commit comments