Skip to content

Commit 87bd8ff

Browse files
authored
SLURM -> Slurm (#173)
With #148 it is suggested to rename SLURM -> Slurm. This is an accompanying PR to change it also on every other page for consistency.
1 parent 5acaec6 commit 87bd8ff

File tree

14 files changed

+53
-53
lines changed

14 files changed

+53
-53
lines changed

docs/clusters/bristen.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,9 @@ Users are encouraged to use containers on Bristen.
4444

4545
## Running Jobs on Bristen
4646

47-
### SLURM
47+
### Slurm
4848

49-
Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
49+
Bristen uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
5050

5151
There is currently a single slurm partition on the system:
5252

@@ -58,7 +58,7 @@ There is currently a single slurm partition on the system:
5858
| `normal` | 32 | - | 24 hours |
5959

6060
<!--
61-
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
61+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
6262
6363
??? example "how to check the number of nodes on the system"
6464
You can check the size of the system by running the following command in the terminal:
@@ -78,7 +78,7 @@ Bristen can also be accessed using [FirecREST][ref-firecrest] at the `https://ap
7878

7979
### Scheduled Maintenance
8080

81-
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
81+
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
8282

8383
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
8484

@@ -87,4 +87,4 @@ Exceptional and non-disruptive updates may happen outside this time frame and wi
8787
!!! change "2025-03-05 container engine updated"
8888
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
8989

90-
### Known issues
90+
### Known issues

docs/clusters/clariden.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,9 +67,9 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy
6767

6868
## Running Jobs on Clariden
6969

70-
### SLURM
70+
### Slurm
7171

72-
Clariden uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
72+
Clariden uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
7373

7474
There are two slurm partitions on the system:
7575

@@ -87,7 +87,7 @@ There are two slurm partitions on the system:
8787
* nodes in the `xfer` partition can be shared
8888
* nodes in the `debug` queue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.
8989

90-
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
90+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
9191

9292
??? example "how to check the number of nodes on the system"
9393
You can check the size of the system by running the following command in the terminal:

docs/clusters/santis.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,11 +72,11 @@ It is also possible to use HPC containers on Santis:
7272

7373
## Running jobs on Santis
7474

75-
### SLURM
75+
### Slurm
7676

77-
Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
77+
Santis uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
7878

79-
There are two [SLURM partitions][ref-slurm-partitions] on the system:
79+
There are two [Slurm partitions][ref-slurm-partitions] on the system:
8080

8181
* the `normal` partition is for all production workloads.
8282
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
@@ -91,7 +91,7 @@ There are two [SLURM partitions][ref-slurm-partitions] on the system:
9191
* nodes in the `normal` and `debug` partitions are not shared
9292
* nodes in the `xfer` partition can be shared
9393

94-
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
94+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
9595

9696
### FirecREST
9797

docs/contributing/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ For adding information about a change, originally designed for recording updates
244244

245245
=== "Rendered"
246246
!!! change "2025-04-17"
247-
* SLURM was upgraded to version 25.1.
247+
* Slurm was upgraded to version 25.1.
248248
* uenv was upgraded to v0.8
249249

250250
Old changes can be folded:
@@ -256,7 +256,7 @@ For adding information about a change, originally designed for recording updates
256256
=== "Markdown"
257257
```
258258
!!! change "2025-04-17"
259-
* SLURM was upgraded to version 25.1.
259+
* Slurm was upgraded to version 25.1.
260260
* uenv was upgraded to v0.8
261261
```
262262

docs/guides/mlp_tutorials/llm-finetuning.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,20 +78,20 @@ accelerate launch --config_file trl/examples/accelerate_configs/multi_gpu.yaml \
7878

7979
This script has quite a bit more content to unpack.
8080
We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where.
81-
Setting this up will be useful in the long run because it means we can tell SLURM how much hardware to reserve, and this script will setup all the details for us.
81+
Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us.
8282

8383
The cluster has four GH200 chips per compute node.
8484
We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`.
8585
Then, we calculate how many processes accelerate should launch.
8686
We want to map each GPU to a separate process, this should be four processes per node.
8787
We multiply this by the number of nodes to obtain the total number of processes.
88-
Next, we use some bash magic to extract the name of the head node from SLURM environment variables.
88+
Next, we use some bash magic to extract the name of the head node from Slurm environment variables.
8989
Accelerate expects one main node and launches tasks on the other nodes from this main node.
9090
Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning.
9191
The first four lines of the launch line are used to configure accelerate.
9292
Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma.
9393

94-
Next, we also need to create a short SLURM batch script to launch our fine-tuning script:
94+
Next, we also need to create a short Slurm batch script to launch our fine-tuning script:
9595

9696
```bash title="fine-tune-sft.sbatch"
9797
#!/bin/bash

docs/guides/mlp_tutorials/llm-inference.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,9 @@ This step is straightforward, just make the file `$HOME/.config/containers/stora
6262
mount_program = "/usr/bin/fuse-overlayfs-1.13"
6363
```
6464

65-
To build a container with Podman, we need to request a shell on a compute node from [SLURM][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
66-
SLURM is a workload manager which distributes workloads on the cluster.
67-
Through SLURM, many people can use the supercomputer at the same time without interfering with one another in any way:
65+
To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
66+
Slurm is a workload manager which distributes workloads on the cluster.
67+
Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way:
6868

6969
```console
7070
$ srun -A <ACCOUNT> --pty bash
@@ -75,7 +75,7 @@ $ enroot import -x mount -o pytorch-24.01-py3-venv.sqsh podman://pytorch:24.01-p
7575
```
7676

7777
where you should replace `<ACCOUNT>` with your project account ID.
78-
At this point, you can exit the SLURM allocation by typing `exit`.
78+
At this point, you can exit the Slurm allocation by typing `exit`.
7979
You should be able to see a new squashfile next to your Dockerfile:
8080

8181
```console
@@ -161,8 +161,8 @@ $ pip install -U "huggingface_hub[cli]"
161161
$ HF_HOME=$SCRATCH/huggingface huggingface-cli login
162162
```
163163

164-
At this point, you can exit the SLURM allocation again by typing `exit`.
165-
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the SLURM job.
164+
At this point, you can exit the Slurm allocation again by typing `exit`.
165+
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the Slurm job.
166166
Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container.
167167
This is because the virtual environment ultimately relies on the resources packaged inside the container.
168168

@@ -196,8 +196,8 @@ There's nothing wrong with this approach per se, but consider that you might be
196196
You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take.
197197
For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster.
198198

199-
Create a SLURM batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
200-
The SLURM batch file should look like this:
199+
Create a Slurm batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
200+
The Slurm batch file should look like this:
201201

202202
```bash title="gemma-inference.sbatch"
203203
#!/bin/bash
@@ -220,14 +220,14 @@ set -x
220220
python ./gemma-inference.py
221221
```
222222

223-
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the SLURM scheduler.
223+
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler.
224224
You can see that one of these options is one we used previously to load our EDF file.
225225
After this, we `cd` to our working directory, `source` our virtual environment and finally run our inference script.
226226

227227
As an alternative to using the `#SBATCH --environment=gemma-pytorch` option you can also run the code in the above script wrapped into an `srun -A <ACCOUNT> -ul --environment=gemma-pytorch bash -c "..."` statement.
228228
The tutorial on nanotron e.g. uses this pattern in `run_tiny_llama.sh`.
229229

230-
Once you've finished editing the batch file, you can save it and run it with SLURM:
230+
Once you've finished editing the batch file, you can save it and run it with Slurm:
231231

232232
```console
233233
$ sbatch ./gemma-inference.sbatch

docs/running/jobreport.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,9 @@ The report is divided into two parts: a general summary and GPU specific values.
5656
| Field | Description |
5757
| ----- | ----------- |
5858
| Job Id | The Slurm job id |
59-
| Step Id | The slurm step id. A job step in SLURM is a subdivision of a job started with srun |
59+
| Step Id | The slurm step id. A job step in Slurm is a subdivision of a job started with srun |
6060
| User | The user account that submitted the job |
61-
| SLURM Account | The project account that will be billed |
61+
| Slurm Account | The project account that will be billed |
6262
| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran |
6363
| Number of Nodes | The number of nodes allocated to the job |
6464
| Number of GPUs | The number of GPUs allocated to the job |
@@ -95,7 +95,7 @@ Summary of Job Statistics
9595
+-----------------------------------------+-----------------------------------------+
9696
| User | jpcoles |
9797
+-----------------------------------------+-----------------------------------------+
98-
| SLURM Account | unknown_account |
98+
| Slurm Account | unknown_account |
9999
+-----------------------------------------+-----------------------------------------+
100100
| Start Time | 03-07-2024 15:32:24 |
101101
+-----------------------------------------+-----------------------------------------+
@@ -134,8 +134,8 @@ GPU Specific Values
134134
If the application crashes or the job is killed by `slurm` prematurely, `jobreport` will not be able to write any output.
135135

136136
!!! warning "Too many GPUs reported by `jobreport`"
137-
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by SLURM, you may be missing options to set the visible devices correctly for your job.
138-
See the [GH200 SLURM documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
137+
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by Slurm, you may be missing options to set the visible devices correctly for your job.
138+
See the [GH200 Slurm documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
139139
When oversubscribing ranks to GPUs, the utility will always report too many GPUs.
140140
The utility does not combine data for the same GPU from different ranks.
141141

@@ -207,7 +207,7 @@ Summary of Job Statistics
207207
+-----------------------------------------+-----------------------------------------+
208208
| User | jpcoles |
209209
+-----------------------------------------+-----------------------------------------+
210-
| SLURM Account | unknown_account |
210+
| Slurm Account | unknown_account |
211211
+-----------------------------------------+-----------------------------------------+
212212
| Start Time | 03-07-2024 14:54:48 |
213213
+-----------------------------------------+-----------------------------------------+

docs/services/firecrest.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -428,7 +428,7 @@ A staging area is used for external transfers and downloading/uploading a file f
428428
```
429429
!!! Note "Job submission through FirecREST"
430430

431-
FirecREST provides an abstraction for job submission using in the backend the SLURM scheduler of the vCluster.
431+
FirecREST provides an abstraction for job submission using in the backend the Slurm scheduler of the vCluster.
432432

433433
When submitting a job via the different [endpoints](https://firecrest.readthedocs.io/en/latest/reference.html#compute), you should pass the `-l` option to the `/bin/bash` command on the batch file.
434434

docs/software/communication/cray-mpich.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ This means that Cray MPICH will automatically be linked to the GTL library, whic
3535

3636
In addition to linking to the GTL library, Cray MPICH must be configured to be GPU-aware at runtime by setting the `MPICH_GPU_SUPPORT_ENABLED=1` environment variable.
3737
On some CSCS systems this option is set by default.
38-
See [this page][ref-slurm-gh200] for more information on configuring SLURM to use GPUs.
38+
See [this page][ref-slurm-gh200] for more information on configuring Slurm to use GPUs.
3939

4040
!!! warning "Segmentation faults when trying to communicate GPU buffers without `MPICH_GPU_SUPPORT_ENABLED=1`"
4141
If you attempt to communicate GPU buffers through MPI without setting `MPICH_GPU_SUPPORT_ENABLED=1`, it will lead to segmentation faults, usually without any specific indication that it is the communication that fails.

docs/software/ml/pytorch.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -315,9 +315,9 @@ $ exit # (6)!
315315
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
316316
However, this workflow is more involved and intended for advanced Spack users.
317317

318-
## Running PyTorch jobs with SLURM
318+
## Running PyTorch jobs with Slurm
319319

320-
```bash title="slurm sbatch script"
320+
```bash title="Slurm sbatch script"
321321
#!/bin/bash
322322
#SBATCH --job-name=myjob
323323
#SBATCH --nodes=1
@@ -383,7 +383,7 @@ srun bash -c "
383383
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
384384
7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
385385
8. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
386-
9. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
386+
9. `RANK` and `LOCAL_RANK` are set per-process by the Slurmjob launcher.
387387
10. Activate the virtual environment created on top of the uenv (if any).
388388
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
389389

0 commit comments

Comments
 (0)