You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cyclecloud/slurm-3.md
+20-21Lines changed: 20 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,15 +11,15 @@ ms.author: anhoward
11
11
Slurm scheduler support was rewritten as part of the CycleCloud 8.4.0 release. Key features include:
12
12
13
13
* Support for dynamic nodes, and dynamic partitions via dynamic nodearays, supporting both single and multiple virtual machine (VM) sizes
14
-
* New slurm versions 23.02 and 22.05.8
14
+
* New Slurm versions 23.02 and 22.05.8
15
15
* Cost reporting via `azslurm` CLI
16
16
*`azslurm` cli based autoscaler
17
17
* Ubuntu 20 support
18
18
* Removed need for topology plugin, and therefore also any submit plugin
19
19
20
20
## Slurm Clusters in CycleCloud versions < 8.4.0
21
21
22
-
See [Transitioning from 2.7 to 3.0](#transitioning-from-27-to-30) for more information.
22
+
For more information, see [Transitioning from 2.7 to 3.0](#transitioning-from-27-to-30).
23
23
24
24
### Making Cluster Changes
25
25
@@ -30,13 +30,13 @@ The Slurm cluster deployed in CycleCloud contains a cli called `azslurm` to faci
30
30
# azslurm scale
31
31
```
32
32
33
-
This creates the partitions with the correct number of nodes, the proper `gres.conf` and restart the `slurmctld`.
33
+
The command creates the partitions with the correct number of nodes, the proper `gres.conf` and restart the `slurmctld`.
34
34
35
-
### No longer pre-creating execute nodes
35
+
### No longer precreating execute nodes
36
36
37
-
Starting CycleCloud version 3.0.0 Slurm project, the nodes aren't pre-creating. Nodes are created when `azslurm resume` is invoked, or by manually creating them in CycleCloud using CLI.
37
+
Starting CycleCloud version 3.0.0 Slurm project, the nodes aren't precreating. Nodes are created when `azslurm resume` is invoked, or by manually creating them in CycleCloud using CLI.
38
38
39
-
### Creating additional partitions
39
+
### Creating extra partitions
40
40
41
41
The default template that ships with Azure CycleCloud has three partitions (`hpc`, `htc` and `dynamic`), and you can define custom nodearrays that map directly to Slurm partitions. For example, to create a GPU partition, add the following section to your cluster template:
42
42
@@ -60,8 +60,7 @@ The default template that ships with Azure CycleCloud has three partitions (`hpc
60
60
61
61
### Dynamic Partitions
62
62
63
-
Starting CycleCloud version 3.0.1, we support dynamic partitions. You can make a `nodearray` map to a dynamic partition by adding the following.
64
-
Note that `myfeature` could be any desired feature description or more than one feature, separated by a comma.
63
+
Starting CycleCloud version 3.0.1, we support dynamic partitions. You can make a `nodearray` map to a dynamic partition by adding the following. The `myfeature` could be any desired feature description or more than one feature, separated by a comma.
65
64
66
65
```ini
67
66
[[[configuration]]]
@@ -72,7 +71,7 @@ Note that `myfeature` could be any desired feature description or more than one
By default, dynamic partition deosn't inclue any nodes. You can start nodes through CycleCloud or by running `azslurm resume` manually, they'll join the cluster using the name you choose. However, since Slurm isn't aware of these nodes ahead of time, it can't autoscale them up.
84
+
By default, dynamic partition doesn't include any nodes. You can start nodes through CycleCloud or by running `azslurm resume` manually, they join the cluster using the name you choose. However, since Slurm isn't aware of these nodes ahead of time, it can't autoscale them up.
86
85
87
-
Instead, you can also pre-create node records like so, which allows Slurm to autoscale them up.
86
+
Instead, you can also precreate node records like so, which allows Slurm to autoscale them up.
Either way, once you have created these nodes in a `State=Cloud` they're now available to autoscale like other nodes.
102
+
Either way, once you create these nodes in a `State=Cloud` they become available for autoscaling like other nodes.
104
103
105
104
To support **multiple VM sizes in a CycleCloud nodearray**, you can alter the template to allow multiple VM sizes by adding `Config.Mutiselect = true`.
106
105
@@ -113,19 +112,19 @@ To support **multiple VM sizes in a CycleCloud nodearray**, you can alter the te
113
112
Config.Multiselect = true
114
113
```
115
114
116
-
### Dynamic Scaledown
115
+
### Dynamic Scale down
117
116
118
-
By default, all nodes in the dynamic partition scales down just like the other partitions. To disable this, see [SuspendExcParts](https://slurm.schedmd.com/slurm.conf.html).
117
+
By default, all nodes in the dynamic partition scales down just like the other partitions. To disable dynamic partition, see [SuspendExcParts](https://slurm.schedmd.com/slurm.conf.html).
119
118
120
119
### Manual scaling
121
120
122
-
If cyclecloud_slurm detects that autoscale is disabled (SuspendTime=-1), it uses the FUTURE state to denote nodes that're powered down instead of relying on the power state in Slurm. That is, when autoscale is enabled, off nodes are denoted as `idle~` in sinfo. When autoscale is disabled, the off nodes will not appear in sinfo at all. You can still see their definition with `scontrol show nodes --future`.
121
+
If cyclecloud_slurm detects that autoscale is disabled (SuspendTime=-1), it uses the FUTURE state to denote nodes that're powered down instead of relying on the power state in Slurm. That is, when autoscale is enabled, off nodes are denoted as `idle~` in sinfo. When autoscaling is off, the inactive nodes don’t show up in sinfo. You can still see their definition with `scontrol show nodes --future`.
123
122
124
-
To start new nodes, run `/opt/azurehpc/slurm/resume_program.sh node_list` (e.g. htc-[1-10]).
123
+
To start new nodes, run `/opt/azurehpc/slurm/resume_program.sh node_list` (for example, htc-[1-10]).
125
124
126
-
To shutdown nodes, run `/opt/azurehpc/slurm/suspend_program.sh node_list` (e.g. htc-[1-10]).
125
+
To shutdown nodes, run `/opt/azurehpc/slurm/suspend_program.sh node_list` (for example, htc-[1-10]).
127
126
128
-
To start a cluster in this mode, simply add `SuspendTime=-1` to the additional slurm config in the template.
127
+
To start a cluster in this mode, simply add `SuspendTime=-1` to the supplemental Slurm config in the template.
129
128
130
129
To switch a cluster to this mode, add `SuspendTime=-1` to the slurm.conf and run `scontrol reconfigure`. Then run `azslurm remove_nodes && azslurm scale`.
131
130
@@ -138,9 +137,9 @@ To switch a cluster to this mode, add `SuspendTime=-1` to the slurm.conf and run
138
137
->
139
138
`/opt/azurehpc/slurm`
140
139
141
-
2. Autoscale logs are now in `/opt/azurehpc/slurm/logs` instead of `/var/log/slurmctld`. Note, that `slurmctld.log`will be in this folder.
140
+
2. Autoscale logs are now in `/opt/azurehpc/slurm/logs` instead of `/var/log/slurmctld`. Note, that `slurmctld.log`is in this folder.
142
141
143
-
3. The `cyclecloud_slurm.sh` script no longer available. It's been replaced by a new CLI tool called `azslurm`, which you can be run as root. `azslurm` also supports autocomplete.
142
+
3. The `cyclecloud_slurm.sh` script no longer available. A new CLI tool called `azslurm` replaced `cyclecloud_slurm.sh`, and you can be run as root. `azslurm` also supports autocomplete.
144
143
145
144
```bash
146
145
[root@scheduler ~]# azslurm
@@ -170,7 +169,7 @@ To switch a cluster to this mode, add `SuspendTime=-1` to the slurm.conf and run
170
169
171
170
5. CycleCloud no longer creates nodes ahead of time. It only creates them when they're needed.
172
171
173
-
6. All slurm binaries are inside the `azure-slurm-install-pkg*.tar.gz` file, under `slurm-pkgs`. They're pulled from a specific binary release. The current binary release is [4.0.0](https://github.com/Azure/cyclecloud-slurm/releases/tag/4.0.0)
172
+
6. All Slurm binaries are inside the `azure-slurm-install-pkg*.tar.gz` file, under `slurm-pkgs`. They're pulled from a specific binary release. The current binary release is [4.0.0](https://github.com/Azure/cyclecloud-slurm/releases/tag/4.0.0)
174
173
175
174
7. For MPI jobs, the only default network boundary is the partition. Unlike version 2.x, each pertition doesn't include multiple "placement groups". So you only have one colocated VMSS per partition. There's no need for the topology plugin anymore, so the job submission plugin isn't needed either. Instead, submitting to multiple partitions is the recommended option for use cases that require jobs submission to multiple placement groups.
0 commit comments