Skip to content

Commit 54ed6b6

Browse files
committed
Update README and bump versions in requirements.yml
1 parent dc2455f commit 54ed6b6

File tree

2 files changed

+9
-18
lines changed

2 files changed

+9
-18
lines changed

docs/mig.md

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -151,20 +151,10 @@ Use the ``vgpu`` metadata option to enable creation of mig devices on rebuild.
151151

152152
## GRES configuration
153153

154-
You should stop terraform templating out partitions.yml and specify `openhpc_nodegroups` manually. To do this
155-
set the `autogenerated_partitions_enabled` terraform variable to `false`. For example (`environments/production/tofu/main.tf`):
156-
157-
```
158-
module "cluster" {
159-
source = "../../site/tofu/"
160-
...
161-
# We manually populate this to add GRES. See environments/site/inventory/group_vars/all/partitions-manual.yml.
162-
autogenerated_partitions_enabled = false
163-
}
164-
```
165-
166-
GPU types can be determined by deploying slurm without any gres configuration and then running
167-
`sudo slurmd -G` on a compute node where GPU resources exist. An example is shown below:
154+
GPU resources need to be added to the OpenHPC nodegroup definitions (`openhpc_nodegroups`). To
155+
do this you need to determine the names of the GPU types as detected by slurm. First
156+
deploy slurm with the default nodegroup definitions to get a working cluster. you will then be
157+
able to run: `sudo slurmd -G` on a compute node where GPU resources exist. An example is shown below:
168158

169159
```
170160
[rocky@io-io-gpu-02 ~]$ sudo slurmd -G
@@ -191,7 +181,7 @@ ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT
191181
```
192182

193183
GRES resources can then be configured manually. An example is shown below
194-
(`environments/<environment>/inventory/group_vars/all/partitions-manual.yml`):
184+
(`environments/<environment>/inventory/group_vars/all/openhpc.yml`):
195185

196186
```
197187
openhpc_partitions:
@@ -207,3 +197,5 @@ openhpc_nodegroups:
207197
- conf: "gpu:nvidia_h100_80gb_hbm3_4g.40gb:2"
208198
- conf: "gpu:nvidia_h100_80gb_hbm3_1g.10gb:6"
209199
```
200+
201+
Making sure the types (the identifier after `gpu:`) match those collected with `slurmd -G`.

requirements.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,6 @@ collections:
5555
version: 0.0.15
5656
- name: stackhpc.pulp
5757
version: 0.5.5
58-
- name: https://github.com/stackhpc/ansible-collection-linux
59-
type: git
60-
version: feature/mig-only
58+
- name: stackhpc.linux
59+
version: 1.4.0
6160
...

0 commit comments

Comments
 (0)