You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -62,20 +62,38 @@ unique set of homogenous nodes:
62
62
*`gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
63
63
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
64
64
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
65
-
-`file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66
-
65
+
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
67
66
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
68
-
69
-
*`default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
70
-
*`maxtime`: Optional. A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`. The default value is
71
-
given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
72
-
*`partition_params`: Optional. Mapping of additional parameters and values for [partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION).
73
-
74
-
For each group (if used) or partition any nodes in an ansible inventory group `<cluster_name>_<group_name>` will be added to the group/partition. Note that:
75
-
- Nodes may have arbitrary hostnames but these should be lowercase to avoid a mismatch between inventory and actual hostname.
76
-
- Nodes in a group are assumed to be homogenous in terms of processor and memory.
77
-
- An inventory group may be empty or missing, but if it is not then the play must contain at least one node from it (used to set processor information).
78
-
67
+
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
68
+
*`node_params`: Optional. Mapping of additional parameters and values for
The following creates a cluster with a a single partition `compute`
174
+
containing two nodes:
161
175
162
-
[cluster_login:children]
163
-
openhpc_login
176
+
```ini
177
+
# inventory/hosts:
178
+
[hpc_login]
179
+
cluster-login-0
164
180
165
-
[cluster_control:children]
166
-
openhpc_login
181
+
[hpc_compute]
182
+
cluster-compute-0
183
+
cluster-compute-1
167
184
168
-
[cluster_batch:children]
169
-
openhpc_compute
170
-
171
-
## Example Playbooks
172
-
173
-
To deploy, create a playbook which looks like this:
174
-
175
-
---
176
-
- hosts:
177
-
- cluster_login
178
-
- cluster_control
179
-
- cluster_batch
180
-
become: yes
181
-
roles:
182
-
- role: openhpc
183
-
openhpc_enable:
184
-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
185
-
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
186
-
runtime: true
187
-
openhpc_slurm_service_enabled: true
188
-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
189
-
openhpc_slurm_partitions:
190
-
- name: "compute"
191
-
openhpc_cluster_name: openhpc
192
-
openhpc_packages: []
193
-
...
185
+
[hpc_control]
186
+
cluster-control
187
+
```
194
188
189
+
```yaml
190
+
#playbook.yml
191
+
---
192
+
- hosts: all
193
+
become: yes
194
+
tasks:
195
+
- import_role:
196
+
name: stackhpc.openhpc
197
+
vars:
198
+
openhpc_cluster_name: hpc
199
+
openhpc_enable:
200
+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
201
+
batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
202
+
runtime: true
203
+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
204
+
openhpc_nodegroups:
205
+
- name: compute
206
+
openhpc_partitions:
207
+
- name: compute
195
208
---
209
+
```
210
+
211
+
### Multiple nodegroups
212
+
213
+
This example shows how partitions can span multiple types of compute node.
214
+
215
+
This example inventory describes three types of compute node (login and
216
+
control nodes are omitted for brevity):
217
+
218
+
```ini
219
+
# inventory/hosts:
220
+
...
221
+
[hpc_general]
222
+
# standard compute nodes
223
+
cluster-general-0
224
+
cluster-general-1
225
+
226
+
[hpc_large]
227
+
# large memory nodes
228
+
cluster-largemem-0
229
+
cluster-largemem-1
230
+
231
+
[hpc_gpu]
232
+
# GPU nodes
233
+
cluster-a100-0
234
+
cluster-a100-1
235
+
...
236
+
```
237
+
238
+
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
239
+
apply any node-level parameters - in this case the `largemem` nodes have
240
+
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
241
+
242
+
```yaml
243
+
openhpc_cluster_name: hpc
244
+
openhpc_nodegroups:
245
+
- name: general
246
+
- name: large
247
+
node_params:
248
+
CoreSpecCount: 2
249
+
- name: gpu
250
+
gres:
251
+
- conf: gpu:A100:2
252
+
file: /dev/nvidia[0-1]
253
+
```
254
+
255
+
Now two partitions can be configured - a default one with a short timelimit and
256
+
no large memory nodes for testing jobs, and another with all hardware and longer
257
+
job runtime for "production" jobs:
258
+
259
+
```yaml
260
+
openhpc_partitions:
261
+
- name: test
262
+
nodegroups:
263
+
- general
264
+
- gpu
265
+
maxtime: '1:0:0'# 1 hour
266
+
default: 'YES'
267
+
- name: general
268
+
nodegroups:
269
+
- general
270
+
- large
271
+
- gpu
272
+
maxtime: '2-0'# 2 days
273
+
default: 'NO'
274
+
```
275
+
Users will select the partition using `--partition` argument and request nodes
276
+
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
277
+
options for `sbatch` or `srun`.
278
+
279
+
Finally here some additional configuration must be provided for GRES:
280
+
```yaml
281
+
openhpc_config:
282
+
GresTypes:
283
+
-gpu
284
+
```
196
285
197
286
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
0 commit comments