You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cyclecloud/slurm.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ Slurm is a highly configurable open source workload manager. For more informatio
16
16
> Starting with CycleCloud 8.4.0, the Slurm integration was rewritten to support new features and functionality. For more information, see [Slurm 3.0](slurm-3.md) documentation.
17
17
18
18
::: moniker range="=cyclecloud-7"
19
-
To enable Slurm on a CycleCloud cluster, modify the "run_list" in the definiton of cluster's configuration section. A Slurm cluster has two main parts: the master (or scheduler) node, which runs the Slurm software on a shared file system, and the execute nodes, which mount that file system and run the submitted jobs. For example, a simple cluster template snippet may look like:
19
+
To enable Slurm on a CycleCloud cluster, modify the "run_list" in the definition of cluster's configuration section. A Slurm cluster has two main parts: the master (or scheduler) node, which runs the Slurm software on a shared file system, and the executed nodes, which mount that file system and run the submitted jobs. For example, a simple cluster template snippet may look like:
20
20
21
21
```ini
22
22
[cluster custom-slurm]
@@ -47,7 +47,7 @@ To enable Slurm on a CycleCloud cluster, modify the "run_list" in the definiton
47
47
::: moniker-end
48
48
49
49
::: moniker range=">=cyclecloud-8"
50
-
Slurm can easily be enabled on a CycleCloud cluster by modifying the "run_list" in the configuration section of your cluster definition. The two basic components of a Slurm cluster are the 'scheduler' node which provides a shared filesystem on which the Slurm software runs, and the 'execute' nodes which are the hosts that mount the shared filesystem and execute the jobs submitted. For example, a simple cluster template snippet may look like:
50
+
Slurm can easily be enabled on a CycleCloud cluster by modifying the 'run_list', available in the configuration section of your cluster definition. The two basic components of a Slurm cluster are the 'scheduler' node which provides a shared filesystem on which the Slurm software runs, and the 'execute' nodes which are the hosts that mount the shared filesystem and executed the jobs submitted. For example, a simple cluster template snippet may look like:
51
51
52
52
```ini
53
53
[cluster custom-slurm]
@@ -92,12 +92,12 @@ The Slurm cluster deployed in CycleCloud contains a script that facilitates the
92
92
```
93
93
94
94
> [!NOTE]
95
-
> For CycleCloud versions prior to 7.9.10, the `cyclecloud_slurm.sh` script is located in _/opt/cycle/jetpack/system/bootstrap/slurm_.
95
+
> For CycleCloud versions before 7.9.10, the `cyclecloud_slurm.sh` script is located in _/opt/cycle/jetpack/system/bootstrap/slurm_.
96
96
97
97
> [!IMPORTANT]
98
98
> If you make any changes that affect the VMs for nodes in an MPI partition (such as VM size, image, or cloud-init), the nodes **must** all be terminated first.
99
99
> The `remove_nodes` command prints a warning in this case, but it doesn't exit with an error.
100
-
> If there're running nodes, you get an error of `This node doesn't match existing scaleset attribute` when new nodes are started.
100
+
> If there are running nodes, you get an error of `This node doesn't match existing scaleset attribute` when new nodes are started.
101
101
102
102
::: moniker-end
103
103
@@ -209,13 +209,13 @@ Add the next attributes to the `Configuration` section:
209
209
210
210
### Autoscale
211
211
212
-
CycleCloud uses Slurm's [Elastic Computing](https://slurm.schedmd.com/elastic_computing.html) feature. To debug autoscale issues, there're a few logs on the scheduler node you can check. The first is making sure that the power save resume calls are being made by checking `/var/log/slurmctld/slurmctld.log`. You should see lines like:
212
+
CycleCloud uses Slurm's [Elastic Computing](https://slurm.schedmd.com/elastic_computing.html) feature. To debug autoscale issues, there are a few logs on the scheduler node you can check. The first is making sure that the power save resume calls are being made by checking `/var/log/slurmctld/slurmctld.log`. You should see lines like:
The other log to check is `/var/log/slurmctld/resume.log`. If the resume step is failing, there's `/var/log/slurmctld/resume_fail.log`. If there're messages about unknown or invalid node names, make sure you haven't added nodes to the cluster without next the steps in the "Making Cluster Changes" section above.
218
+
The other log to check is `/var/log/slurmctld/resume.log`. If the resume step is failing, there's `/var/log/slurmctld/resume_fail.log`. If there are messages about unknown or invalid node names, ensure nodes aren't added to the cluster without next the steps in the "Making Cluster Changes" section overhead.
0 commit comments