Skip to content

Commit 3bf7db9

Browse files
committed
update rebuild docs
1 parent 38e5fd3 commit 3bf7db9

File tree

1 file changed

+9
-34
lines changed

1 file changed

+9
-34
lines changed

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 9 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -107,42 +107,17 @@ The configuration of this is complex and involves:
107107
defined in the `compute` or `login` variables, to override the default
108108
image for specific node groups.
109109
110-
5. Modify `openhpc_slurm_partitions` to add a new partition covering rebuildable
111-
nodes to use for for rebuild jobs. If using the default OpenTofu
112-
configurations, this variable is contained in an OpenTofu-templated file
113-
`environments/$ENV/group_vars/all/partitions.yml` which must be overriden
114-
by copying it to e.g. a `z_partitions.yml` file in the same directory.
115-
However production sites will probably be overriding this file anyway to
116-
customise it.
117-
118-
An example partition definition, given the two node groups "general" and
119-
"gpu" shown in Step 2, is:
120-
121-
```yaml
122-
openhpc_slurm_partitions:
123-
...
124-
- name: rebuild
125-
groups:
126-
- name: general
127-
- name: gpu
128-
default: NO
129-
maxtime: 30
130-
partition_params:
131-
PriorityJobFactor: 65533
132-
Hidden: YES
133-
RootOnly: YES
134-
DisableRootJobs: NO
135-
PreemptMode: 'OFF'
136-
OverSubscribe: EXCLUSIVE
137-
```
138-
139-
Which has parameters as follows:
110+
5. Ensure `openhpc_partitions` contains a partition covering the nodes to run
111+
rebuild jobs. The default definition in `environments/common/inventory/group_vars/all/openhpc.yml`
112+
will automatically include this via `openhpc_rebuild_partition` also in that
113+
file. If modifying this, note the important parameters are:
114+
140115
- `name`: Partition name matching `rebuild` role variable `rebuild_partitions`,
141116
default `rebuild`.
142-
- `groups`: A list of node group names, matching keys in the OpenTofu
143-
`compute` variable (see example in step 2 above). Normally every compute
144-
node group should be listed here, unless Slurm-controlled rebuild is not
145-
required for certain node groups.
117+
- `groups`: A list of nodegroup names, matching `openhpc_nodegroup` and
118+
keys in the OpenTofu `compute` variable (see example in step 2 above).
119+
Normally every compute node group should be listed here, unless
120+
Slurm-controlled rebuild is not required for certain node groups.
146121
- `default`: Must be set to `NO` so that it is not the default partition.
147122
- `maxtime`: Maximum time to allow for rebuild jobs, in
148123
[slurm.conf format](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).

0 commit comments

Comments
 (0)