You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -64,8 +64,10 @@ unique set of homogenous nodes:
64
64
-`conf`: A string with the [resource specification](https://slurm.schedmd.com/slurm.conf.html#OPT_Gres_1) but requiring the format `<name>:<type>:<number>`, e.g. `gpu:A100:2`. Note the `type` is an arbitrary string.
65
65
-`file`: Omit if `gres_autodetect` is set. A string with the [File](https://slurm.schedmd.com/gres.conf.html#OPT_File) (path to device(s)) for this resource, e.g. `/dev/nvidia[0-1]` for the above example.
66
66
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
67
+
*`features`: Optional. List of [Features](https://slurm.schedmd.com/slurm.conf.html#OPT_Features) strings.
67
68
*`node_params`: Optional. Mapping of additional parameters and values for
@@ -79,18 +81,23 @@ unique set of homogenous nodes:
79
81
This is used to set `Sockets`, `CoresPerSocket`, `ThreadsPerCore` and
80
82
optionally `RealMemory` for the nodegroup.
81
83
82
-
`openhpc_partitions`: Optional, default `[]`. List of mappings, each defining a
84
+
`openhpc_partitions`: Optional. List of mappings, each defining a
83
85
partition. Each partition mapping may contain:
84
86
*`name`: Required. Name of partition.
85
-
*`groups`: Optional. List of nodegroup names. If omitted, the partition name
86
-
is assumed to match a nodegroup name.
87
+
*`groups`: Optional. List of node group names. If omitted, the node group
88
+
with the same name as the partition is used.
87
89
*`default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
88
-
*`maxtime`: Optional. A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`. The default value is
89
-
given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
90
+
*`maxtime`: Optional. A partition-specific time limit overriding `openhpc_job_maxtime`.
90
91
*`partition_params`: Optional. Mapping of additional parameters and values for
**NB:** Parameters which can be set via the keys above must not be included here.
92
94
93
-
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days). See [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime` for format. The default is 60 days. The value should be quoted to avoid Ansible conversions.
95
+
If this variable is not set one partition per nodegroup is created, with default
96
+
partition configuration for each.
97
+
98
+
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days), see
The following creates a cluster with a a single partition `compute`
174
+
containing two nodes:
175
175
176
-
[cluster_control:children]
177
-
openhpc_login
176
+
```ini
177
+
# inventory/hosts:
178
+
[hpc_login]
179
+
cluster-login-0
178
180
179
-
[cluster_batch:children]
180
-
openhpc_compute
181
+
[hpc_compute]
182
+
cluster-compute-0
183
+
cluster-compute-1
181
184
182
-
## Example Playbooks
183
-
184
-
To deploy, create a playbook which looks like this:
185
-
186
-
---
187
-
- hosts:
188
-
- cluster_login
189
-
- cluster_control
190
-
- cluster_batch
191
-
become: yes
192
-
roles:
193
-
- role: openhpc
194
-
openhpc_enable:
195
-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
196
-
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
197
-
runtime: true
198
-
openhpc_slurm_service_enabled: true
199
-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
200
-
openhpc_slurm_partitions:
201
-
- name: "compute"
202
-
openhpc_cluster_name: openhpc
203
-
openhpc_packages: []
204
-
...
185
+
[hpc_control]
186
+
cluster-control
187
+
```
205
188
189
+
```yaml
190
+
#playbook.yml
191
+
---
192
+
- hosts: all
193
+
become: yes
194
+
tasks:
195
+
- import_role:
196
+
name: stackhpc.openhpc
197
+
vars:
198
+
openhpc_cluster_name: hpc
199
+
openhpc_enable:
200
+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
201
+
batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
202
+
runtime: true
203
+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
204
+
openhpc_nodegroups:
205
+
- name: compute
206
+
openhpc_partitions:
207
+
- name: compute
206
208
---
209
+
```
210
+
211
+
### Multiple nodegroups
212
+
213
+
This example shows how partitions can span multiple types of compute node.
214
+
215
+
This example inventory describes three types of compute node (login and
216
+
control nodes are omitted for brevity):
217
+
218
+
```ini
219
+
# inventory/hosts:
220
+
...
221
+
[hpc_general]
222
+
# standard compute nodes
223
+
cluster-general-0
224
+
cluster-general-1
225
+
226
+
[hpc_large]
227
+
# large memory nodes
228
+
cluster-largemem-0
229
+
cluster-largemem-1
230
+
231
+
[hpc_gpu]
232
+
# GPU nodes
233
+
cluster-a100-0
234
+
cluster-a100-1
235
+
...
236
+
```
237
+
238
+
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
239
+
apply any node-level parameters - in this case the `largemem` nodes have
240
+
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
241
+
242
+
```yaml
243
+
openhpc_cluster_name: hpc
244
+
openhpc_nodegroups:
245
+
- name: general
246
+
- name: large
247
+
node_params:
248
+
CoreSpecCount: 2
249
+
- name: gpu
250
+
gres:
251
+
- conf: gpu:A100:2
252
+
file: /dev/nvidia[0-1]
253
+
```
254
+
255
+
Now two partitions can be configured - a default one with a short timelimit and
256
+
no large memory nodes for testing jobs, and another with all hardware and longer
257
+
job runtime for "production" jobs:
258
+
259
+
```yaml
260
+
openhpc_partitions:
261
+
- name: test
262
+
groups:
263
+
- general
264
+
- gpu
265
+
maxtime: '1:0:0'# 1 hour
266
+
default: 'YES'
267
+
- name: general
268
+
groups:
269
+
- general
270
+
- large
271
+
- gpu
272
+
maxtime: '2-0'# 2 days
273
+
default: 'NO'
274
+
```
275
+
Users will select the partition using `--partition` argument and request nodes
276
+
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
277
+
options for `sbatch` or `srun`.
278
+
279
+
Finally here some additional configuration must be provided for GRES:
280
+
```yaml
281
+
openhpc_config:
282
+
GresTypes:
283
+
-gpu
284
+
```
207
285
208
286
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)
Copy file name to clipboardExpand all lines: molecule/test14/converge.yml
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@
7
7
batch: "{{ inventory_hostname in groups['testohpc_compute'] }}"
8
8
runtime: true
9
9
openhpc_slurm_control_host: "{{ groups['testohpc_login'] | first }}"
10
-
openhpc_slurm_partitions:
10
+
openhpc_nodegroups:
11
11
- name: "compute"
12
12
extra_nodes:
13
13
# Need to specify IPs for the non-existent State=DOWN nodes, because otherwise even in this state slurmctld will exclude a node with no lookup information from the config.
0 commit comments