You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -62,38 +62,20 @@ unique set of homogenous nodes:
62
62
*`gres_autodetect`: Optional. The [auto detection mechanism](https://slurm.schedmd.com/gres.conf.html#OPT_AutoDetect) to use for the generic resources. Note: you must still define the `gres` dictionary (see below) but you only need the define the `conf` key.
63
63
*`gres`: Optional. List of dicts defining [generic resources](https://slurm.schedmd.com/gres.html). Each dict should define:
64
64
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
65
-
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
65
+
-`file`: A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66
+
66
67
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
67
-
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
68
-
*`node_params`: Optional. Mapping of additional parameters and values for
**NB:** Parameters which can be set via the keys above must not be included here.
94
-
95
-
If this variable is not set one partition per nodegroup is created, with default
96
-
partition configuration for each.
68
+
69
+
*`default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
70
+
*`maxtime`: Optional. A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`. The default value is
71
+
given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
72
+
*`partition_params`: Optional. Mapping of additional parameters and values for [partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION).
73
+
74
+
For each group (if used) or partition any nodes in an ansible inventory group `<cluster_name>_<group_name>` will be added to the group/partition. Note that:
75
+
- Nodes may have arbitrary hostnames but these should be lowercase to avoid a mismatch between inventory and actual hostname.
76
+
- Nodes in a group are assumed to be homogenous in terms of processor and memory.
77
+
- An inventory group may be empty or missing, but if it is not then the play must contain at least one node from it (used to set processor information).
78
+
97
79
98
80
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see
To deploy, create a playbook which looks like this:
174
+
175
+
---
176
+
- hosts:
177
+
- cluster_login
178
+
- cluster_control
179
+
- cluster_batch
180
+
become: yes
181
+
roles:
182
+
- role: openhpc
183
+
openhpc_enable:
184
+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
185
+
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
186
+
runtime: true
187
+
openhpc_slurm_service_enabled: true
188
+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
189
+
openhpc_slurm_partitions:
190
+
- name: "compute"
191
+
openhpc_cluster_name: openhpc
192
+
openhpc_packages: []
193
+
...
188
194
189
-
```yaml
190
-
#playbook.yml
191
-
---
192
-
- hosts: all
193
-
become: yes
194
-
tasks:
195
-
- import_role:
196
-
name: stackhpc.openhpc
197
-
vars:
198
-
openhpc_cluster_name: hpc
199
-
openhpc_enable:
200
-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
201
-
batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
202
-
runtime: true
203
-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
204
-
openhpc_nodegroups:
205
-
- name: compute
206
-
openhpc_partitions:
207
-
- name: compute
208
195
---
209
-
```
210
-
211
-
### Multiple nodegroups
212
-
213
-
This example shows how partitions can span multiple types of compute node.
214
-
215
-
This example inventory describes three types of compute node (login and
216
-
control nodes are omitted for brevity):
217
-
218
-
```ini
219
-
# inventory/hosts:
220
-
...
221
-
[hpc_general]
222
-
# standard compute nodes
223
-
cluster-general-0
224
-
cluster-general-1
225
-
226
-
[hpc_large]
227
-
# large memory nodes
228
-
cluster-largemem-0
229
-
cluster-largemem-1
230
-
231
-
[hpc_gpu]
232
-
# GPU nodes
233
-
cluster-a100-0
234
-
cluster-a100-1
235
-
...
236
-
```
237
-
238
-
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
239
-
apply any node-level parameters - in this case the `largemem` nodes have
240
-
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
241
-
242
-
```yaml
243
-
openhpc_cluster_name: hpc
244
-
openhpc_nodegroups:
245
-
- name: general
246
-
- name: large
247
-
node_params:
248
-
CoreSpecCount: 2
249
-
- name: gpu
250
-
gres:
251
-
- conf: gpu:A100:2
252
-
file: /dev/nvidia[0-1]
253
-
```
254
-
255
-
Now two partitions can be configured - a default one with a short timelimit and
256
-
no large memory nodes for testing jobs, and another with all hardware and longer
257
-
job runtime for "production" jobs:
258
-
259
-
```yaml
260
-
openhpc_partitions:
261
-
- name: test
262
-
groups:
263
-
- general
264
-
- gpu
265
-
maxtime: '1:0:0'# 1 hour
266
-
default: 'YES'
267
-
- name: general
268
-
groups:
269
-
- general
270
-
- large
271
-
- gpu
272
-
maxtime: '2-0'# 2 days
273
-
default: 'NO'
274
-
```
275
-
Users will select the partition using `--partition` argument and request nodes
276
-
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
277
-
options for `sbatch` or `srun`.
278
-
279
-
Finally here some additional configuration must be provided for GRES:
280
-
```yaml
281
-
openhpc_config:
282
-
GresTypes:
283
-
-gpu
284
-
```
285
196
286
197
<bid="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
0 commit comments