You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clusters/bristen.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,9 +44,9 @@ Users are encouraged to use containers on Bristen.
44
44
45
45
## Running Jobs on Bristen
46
46
47
-
### SLURM
47
+
### Slurm
48
48
49
-
Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
49
+
Bristen uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
50
50
51
51
There is currently a single slurm partition on the system:
52
52
@@ -58,7 +58,7 @@ There is currently a single slurm partition on the system:
58
58
|`normal`| 32 | - | 24 hours |
59
59
60
60
<!--
61
-
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
61
+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
62
62
63
63
??? example "how to check the number of nodes on the system"
64
64
You can check the size of the system by running the following command in the terminal:
@@ -78,7 +78,7 @@ Bristen can also be accessed using [FirecREST][ref-firecrest] at the `https://ap
78
78
79
79
### Scheduled Maintenance
80
80
81
-
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
81
+
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
82
82
83
83
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
84
84
@@ -87,4 +87,4 @@ Exceptional and non-disruptive updates may happen outside this time frame and wi
87
87
!!! change "2025-03-05 container engine updated"
88
88
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
Copy file name to clipboardExpand all lines: docs/clusters/clariden.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -67,9 +67,9 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy
67
67
68
68
## Running Jobs on Clariden
69
69
70
-
### SLURM
70
+
### Slurm
71
71
72
-
Clariden uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
72
+
Clariden uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
73
73
74
74
There are two slurm partitions on the system:
75
75
@@ -87,7 +87,7 @@ There are two slurm partitions on the system:
87
87
* nodes in the `xfer` partition can be shared
88
88
* nodes in the `debug` queue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.
89
89
90
-
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
90
+
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
91
91
92
92
??? example "how to check the number of nodes on the system"
93
93
You can check the size of the system by running the following command in the terminal:
This script has quite a bit more content to unpack.
80
80
We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where.
81
-
Setting this up will be useful in the long run because it means we can tell SLURM how much hardware to reserve, and this script will setup all the details for us.
81
+
Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us.
82
82
83
83
The cluster has four GH200 chips per compute node.
84
84
We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`.
85
85
Then, we calculate how many processes accelerate should launch.
86
86
We want to map each GPU to a separate process, this should be four processes per node.
87
87
We multiply this by the number of nodes to obtain the total number of processes.
88
-
Next, we use some bash magic to extract the name of the head node from SLURM environment variables.
88
+
Next, we use some bash magic to extract the name of the head node from Slurm environment variables.
89
89
Accelerate expects one main node and launches tasks on the other nodes from this main node.
90
90
Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning.
91
91
The first four lines of the launch line are used to configure accelerate.
92
92
Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma.
93
93
94
-
Next, we also need to create a short SLURM batch script to launch our fine-tuning script:
94
+
Next, we also need to create a short Slurm batch script to launch our fine-tuning script:
Copy file name to clipboardExpand all lines: docs/guides/mlp_tutorials/llm-inference.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,9 +62,9 @@ This step is straightforward, just make the file `$HOME/.config/containers/stora
62
62
mount_program = "/usr/bin/fuse-overlayfs-1.13"
63
63
```
64
64
65
-
To build a container with Podman, we need to request a shell on a compute node from [SLURM][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
66
-
SLURM is a workload manager which distributes workloads on the cluster.
67
-
Through SLURM, many people can use the supercomputer at the same time without interfering with one another in any way:
65
+
To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
66
+
Slurm is a workload manager which distributes workloads on the cluster.
67
+
Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way:
At this point, you can exit the SLURM allocation again by typing `exit`.
165
-
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the SLURM job.
164
+
At this point, you can exit the Slurm allocation again by typing `exit`.
165
+
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the Slurm job.
166
166
Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container.
167
167
This is because the virtual environment ultimately relies on the resources packaged inside the container.
168
168
@@ -196,8 +196,8 @@ There's nothing wrong with this approach per se, but consider that you might be
196
196
You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take.
197
197
For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster.
198
198
199
-
Create a SLURM batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
200
-
The SLURM batch file should look like this:
199
+
Create a Slurm batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
200
+
The Slurm batch file should look like this:
201
201
202
202
```bash title="gemma-inference.sbatch"
203
203
#!/bin/bash
@@ -220,14 +220,14 @@ set -x
220
220
python ./gemma-inference.py
221
221
```
222
222
223
-
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the SLURM scheduler.
223
+
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler.
224
224
You can see that one of these options is one we used previously to load our EDF file.
225
225
After this, we `cd` to our working directory, `source` our virtual environment and finally run our inference script.
226
226
227
227
As an alternative to using the `#SBATCH --environment=gemma-pytorch` option you can also run the code in the above script wrapped into an `srun -A <ACCOUNT> -ul --environment=gemma-pytorch bash -c "..."` statement.
228
228
The tutorial on nanotron e.g. uses this pattern in `run_tiny_llama.sh`.
229
229
230
-
Once you've finished editing the batch file, you can save it and run it with SLURM:
230
+
Once you've finished editing the batch file, you can save it and run it with Slurm:
If the application crashes or the job is killed by `slurm` prematurely, `jobreport` will not be able to write any output.
135
135
136
136
!!! warning "Too many GPUs reported by `jobreport`"
137
-
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by SLURM, you may be missing options to set the visible devices correctly for your job.
138
-
See the [GH200 SLURM documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
137
+
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by Slurm, you may be missing options to set the visible devices correctly for your job.
138
+
See the [GH200 Slurm documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
139
139
When oversubscribing ranks to GPUs, the utility will always report too many GPUs.
140
140
The utility does not combine data for the same GPU from different ranks.
Copy file name to clipboardExpand all lines: docs/services/firecrest.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -428,7 +428,7 @@ A staging area is used for external transfers and downloading/uploading a file f
428
428
```
429
429
!!! Note "Job submission through FirecREST"
430
430
431
-
FirecREST provides an abstraction for job submission using in the backend the SLURM scheduler of the vCluster.
431
+
FirecREST provides an abstraction for job submission using in the backend the Slurm scheduler of the vCluster.
432
432
433
433
When submitting a job via the different [endpoints](https://firecrest.readthedocs.io/en/latest/reference.html#compute), you should pass the `-l` option to the `/bin/bash` command on the batch file.
Copy file name to clipboardExpand all lines: docs/software/communication/cray-mpich.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ This means that Cray MPICH will automatically be linked to the GTL library, whic
35
35
36
36
In addition to linking to the GTL library, Cray MPICH must be configured to be GPU-aware at runtime by setting the `MPICH_GPU_SUPPORT_ENABLED=1` environment variable.
37
37
On some CSCS systems this option is set by default.
38
-
See [this page][ref-slurm-gh200] for more information on configuring SLURM to use GPUs.
38
+
See [this page][ref-slurm-gh200] for more information on configuring Slurm to use GPUs.
39
39
40
40
!!! warning "Segmentation faults when trying to communicate GPU buffers without `MPICH_GPU_SUPPORT_ENABLED=1`"
41
41
If you attempt to communicate GPU buffers through MPI without setting `MPICH_GPU_SUPPORT_ENABLED=1`, it will lead to segmentation faults, usually without any specific indication that it is the communication that fails.
Copy file name to clipboardExpand all lines: docs/software/ml/pytorch.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -315,9 +315,9 @@ $ exit # (6)!
315
315
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
316
316
However, this workflow is more involved and intended for advanced Spack users.
317
317
318
-
## Running PyTorch jobs with SLURM
318
+
## Running PyTorch jobs with Slurm
319
319
320
-
```bash title="slurm sbatch script"
320
+
```bash title="Slurm sbatch script"
321
321
#!/bin/bash
322
322
#SBATCH --job-name=myjob
323
323
#SBATCH --nodes=1
@@ -383,7 +383,7 @@ srun bash -c "
383
383
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
384
384
7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
385
385
8. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
386
-
9.`RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
386
+
9.`RANK` and `LOCAL_RANK` are set per-process by the Slurmjob launcher.
387
387
10. Activate the virtual environment created on top of the uenv (if any).
388
388
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
0 commit comments