Skip to content

Commit f126bba

Browse files
committed
add better examples in README
1 parent 4f3bbc8 commit f126bba

File tree

1 file changed

+114
-43
lines changed

1 file changed

+114
-43
lines changed

README.md

Lines changed: 114 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -161,50 +161,121 @@ accessed (with facts gathering enabled) using `ansible_local.slurm`. As per the
161161
in mixed case are from from config files. Note the facts are only refreshed
162162
when this role is run.
163163

164-
## Example Inventory
165-
166-
And an Ansible inventory as this:
167-
168-
[openhpc_login]
169-
openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos
170-
171-
[openhpc_compute]
172-
openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
173-
openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos
174-
175-
[cluster_login:children]
176-
openhpc_login
177-
178-
[cluster_control:children]
179-
openhpc_login
180-
181-
[cluster_batch:children]
182-
openhpc_compute
183-
184-
## Example Playbooks
185-
186-
To deploy, create a playbook which looks like this:
187-
188-
---
189-
- hosts:
190-
- cluster_login
191-
- cluster_control
192-
- cluster_batch
193-
become: yes
194-
roles:
195-
- role: openhpc
196-
openhpc_enable:
197-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
198-
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
199-
runtime: true
200-
openhpc_slurm_service_enabled: true
201-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
202-
openhpc_slurm_partitions:
203-
- name: "compute"
204-
openhpc_cluster_name: openhpc
205-
openhpc_packages: []
206-
...
164+
## Example
207165

166+
### Simple
167+
168+
The following creates a cluster with a a single partition `compute`
169+
containing two nodes:
170+
171+
```ini
172+
# inventory/hosts:
173+
[hpc_login]
174+
cluster-login-0
175+
176+
[hpc_compute]
177+
cluster-compute-0
178+
cluster-compute-1
179+
180+
[hpc_control]
181+
cluster-control
182+
```
183+
184+
```yaml
185+
#playbook.yml
186+
---
187+
- hosts: all
188+
become: yes
189+
tasks:
190+
- import_role:
191+
name: stackhpc.openhpc
192+
vars:
193+
openhpc_cluster_name: hpc
194+
openhpc_enable:
195+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
196+
batch: "{{ inventory_hostname in groups['cluster_compute'] }}"
197+
runtime: true
198+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
199+
openhpc_nodegroups:
200+
- name: compute
201+
openhpc_partitions:
202+
- name: compute
208203
---
204+
```
205+
206+
### Multiple nodegroups
207+
208+
This example shows how partitions can span multiple types of compute node.
209+
210+
This example inventory describes three types of compute node (login and
211+
control nodes are omitted for brevity):
212+
213+
```ini
214+
# inventory/hosts:
215+
...
216+
[hpc_general]
217+
# standard compute nodes
218+
cluster-general-0
219+
cluster-general-1
220+
221+
[hpc_large]
222+
# large memory nodes
223+
cluster-largemem-0
224+
cluster-largemem-1
225+
226+
[hpc_gpu]
227+
# GPU nodes
228+
cluster-a100-0
229+
cluster-a100-1
230+
...
231+
```
232+
233+
Firstly the `openhpc_nodegroups` is set to capture these inventory groups and
234+
apply any node-level parameters - in this case the `largemem` nodes have
235+
2x cores reserved for some reason, and GRES is configured for the GPU nodes:
236+
237+
```yaml
238+
openhpc_cluster_name: hpc
239+
openhpc_nodegroups:
240+
- name: general
241+
- name: large
242+
node_params:
243+
CoreSpecCount: 2
244+
- name: gpu
245+
gres:
246+
- conf: gpu:A100:2
247+
file: /dev/nvidia[0-1]
248+
```
249+
250+
Now two partitions can be configured - a default one with a short timelimit and
251+
no large memory nodes for testing jobs, and another with all hardware and longer
252+
job runtime for "production" jobs:
253+
254+
```yaml
255+
openhpc_partitions:
256+
- name: test
257+
groups:
258+
- general
259+
- gpu
260+
maxtime: '1:0:0' # 1 hour
261+
default: 'YES'
262+
- name: general
263+
groups:
264+
- general
265+
- large
266+
- gpu
267+
maxtime: '2-0' # 2 days
268+
default: 'NO'
269+
```
270+
Users will select the partition using `--partition` argument and request nodes
271+
with appropriate memory or GPUs using the `--mem` and `--gres` or `--gpus*`
272+
options for `sbatch` or `srun`.
273+
274+
Finally here some additional configuration must be provided for GRES:
275+
```yaml
276+
openhpc_config:
277+
GresTypes:
278+
-gpu
279+
```
209280

210281
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [↩](#accounting_storage)

0 commit comments

Comments
 (0)