Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/clusters/bristen.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ Users are encouraged to use containers on Bristen.

## Running Jobs on Bristen

### SLURM
### Slurm

Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
Bristen uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There is currently a single slurm partition on the system:

Expand All @@ -58,7 +58,7 @@ There is currently a single slurm partition on the system:
| `normal` | 32 | - | 24 hours |

<!--
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].

??? example "how to check the number of nodes on the system"
You can check the size of the system by running the following command in the terminal:
Expand All @@ -78,7 +78,7 @@ Bristen can also be accessed using [FirecREST][ref-firecrest] at the `https://ap

### Scheduled Maintenance

Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.

Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).

Expand All @@ -87,4 +87,4 @@ Exceptional and non-disruptive updates may happen outside this time frame and wi
!!! change "2025-03-05 container engine updated"
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.

### Known issues
### Known issues
6 changes: 3 additions & 3 deletions docs/clusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,9 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy

## Running Jobs on Clariden

### SLURM
### Slurm

Clariden uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
Clariden uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There are two slurm partitions on the system:

Expand All @@ -87,7 +87,7 @@ There are two slurm partitions on the system:
* nodes in the `xfer` partition can be shared
* nodes in the `debug` queue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.

See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].

??? example "how to check the number of nodes on the system"
You can check the size of the system by running the following command in the terminal:
Expand Down
8 changes: 4 additions & 4 deletions docs/clusters/santis.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ It is also possible to use HPC containers on Santis:

## Running jobs on Santis

### SLURM
### Slurm

Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
Santis uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There are two [SLURM partitions][ref-slurm-partitions] on the system:
There are two [Slurm partitions][ref-slurm-partitions] on the system:

* the `normal` partition is for all production workloads.
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
Expand All @@ -91,7 +91,7 @@ There are two [SLURM partitions][ref-slurm-partitions] on the system:
* nodes in the `normal` and `debug` partitions are not shared
* nodes in the `xfer` partition can be shared

See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].

### FirecREST

Expand Down
4 changes: 2 additions & 2 deletions docs/contributing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ For adding information about a change, originally designed for recording updates

=== "Rendered"
!!! change "2025-04-17"
* SLURM was upgraded to version 25.1.
* Slurm was upgraded to version 25.1.
* uenv was upgraded to v0.8

Old changes can be folded:
Expand All @@ -256,7 +256,7 @@ For adding information about a change, originally designed for recording updates
=== "Markdown"
```
!!! change "2025-04-17"
* SLURM was upgraded to version 25.1.
* Slurm was upgraded to version 25.1.
* uenv was upgraded to v0.8
```

Expand Down
6 changes: 3 additions & 3 deletions docs/guides/mlp_tutorials/llm-finetuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,20 +78,20 @@ accelerate launch --config_file trl/examples/accelerate_configs/multi_gpu.yaml \

This script has quite a bit more content to unpack.
We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where.
Setting this up will be useful in the long run because it means we can tell SLURM how much hardware to reserve, and this script will setup all the details for us.
Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us.

The cluster has four GH200 chips per compute node.
We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`.
Then, we calculate how many processes accelerate should launch.
We want to map each GPU to a separate process, this should be four processes per node.
We multiply this by the number of nodes to obtain the total number of processes.
Next, we use some bash magic to extract the name of the head node from SLURM environment variables.
Next, we use some bash magic to extract the name of the head node from Slurm environment variables.
Accelerate expects one main node and launches tasks on the other nodes from this main node.
Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning.
The first four lines of the launch line are used to configure accelerate.
Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma.

Next, we also need to create a short SLURM batch script to launch our fine-tuning script:
Next, we also need to create a short Slurm batch script to launch our fine-tuning script:

```bash title="fine-tune-sft.sbatch"
#!/bin/bash
Expand Down
20 changes: 10 additions & 10 deletions docs/guides/mlp_tutorials/llm-inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ This step is straightforward, just make the file `$HOME/.config/containers/stora
mount_program = "/usr/bin/fuse-overlayfs-1.13"
```

To build a container with Podman, we need to request a shell on a compute node from [SLURM][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
SLURM is a workload manager which distributes workloads on the cluster.
Through SLURM, many people can use the supercomputer at the same time without interfering with one another in any way:
To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot.
Slurm is a workload manager which distributes workloads on the cluster.
Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way:

```console
$ srun -A <ACCOUNT> --pty bash
Expand All @@ -75,7 +75,7 @@ $ enroot import -x mount -o pytorch-24.01-py3-venv.sqsh podman://pytorch:24.01-p
```

where you should replace `<ACCOUNT>` with your project account ID.
At this point, you can exit the SLURM allocation by typing `exit`.
At this point, you can exit the Slurm allocation by typing `exit`.
You should be able to see a new squashfile next to your Dockerfile:

```console
Expand Down Expand Up @@ -161,8 +161,8 @@ $ pip install -U "huggingface_hub[cli]"
$ HF_HOME=$SCRATCH/huggingface huggingface-cli login
```

At this point, you can exit the SLURM allocation again by typing `exit`.
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the SLURM job.
At this point, you can exit the Slurm allocation again by typing `exit`.
If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the Slurm job.
Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container.
This is because the virtual environment ultimately relies on the resources packaged inside the container.

Expand Down Expand Up @@ -196,8 +196,8 @@ There's nothing wrong with this approach per se, but consider that you might be
You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take.
For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster.

Create a SLURM batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
The SLURM batch file should look like this:
Create a Slurm batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory.
The Slurm batch file should look like this:

```bash title="gemma-inference.sbatch"
#!/bin/bash
Expand All @@ -220,14 +220,14 @@ set -x
python ./gemma-inference.py
```

The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the SLURM scheduler.
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler.
You can see that one of these options is one we used previously to load our EDF file.
After this, we `cd` to our working directory, `source` our virtual environment and finally run our inference script.

As an alternative to using the `#SBATCH --environment=gemma-pytorch` option you can also run the code in the above script wrapped into an `srun -A <ACCOUNT> -ul --environment=gemma-pytorch bash -c "..."` statement.
The tutorial on nanotron e.g. uses this pattern in `run_tiny_llama.sh`.

Once you've finished editing the batch file, you can save it and run it with SLURM:
Once you've finished editing the batch file, you can save it and run it with Slurm:

```console
$ sbatch ./gemma-inference.sbatch
Expand Down
12 changes: 6 additions & 6 deletions docs/running/jobreport.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ The report is divided into two parts: a general summary and GPU specific values.
| Field | Description |
| ----- | ----------- |
| Job Id | The Slurm job id |
| Step Id | The slurm step id. A job step in SLURM is a subdivision of a job started with srun |
| Step Id | The slurm step id. A job step in Slurm is a subdivision of a job started with srun |
| User | The user account that submitted the job |
| SLURM Account | The project account that will be billed |
| Slurm Account | The project account that will be billed |
| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran |
| Number of Nodes | The number of nodes allocated to the job |
| Number of GPUs | The number of GPUs allocated to the job |
Expand Down Expand Up @@ -95,7 +95,7 @@ Summary of Job Statistics
+-----------------------------------------+-----------------------------------------+
| User | jpcoles |
+-----------------------------------------+-----------------------------------------+
| SLURM Account | unknown_account |
| Slurm Account | unknown_account |
+-----------------------------------------+-----------------------------------------+
| Start Time | 03-07-2024 15:32:24 |
+-----------------------------------------+-----------------------------------------+
Expand Down Expand Up @@ -134,8 +134,8 @@ GPU Specific Values
If the application crashes or the job is killed by `slurm` prematurely, `jobreport` will not be able to write any output.

!!! warning "Too many GPUs reported by `jobreport`"
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by SLURM, you may be missing options to set the visible devices correctly for your job.
See the [GH200 SLURM documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
If the job reporting utility reports more GPUs than you expect from the number of nodes requested by Slurm, you may be missing options to set the visible devices correctly for your job.
See the [GH200 Slurm documentation][ref-slurm-gh200] for examples on how to expose GPUs correctly in your job.
When oversubscribing ranks to GPUs, the utility will always report too many GPUs.
The utility does not combine data for the same GPU from different ranks.

Expand Down Expand Up @@ -207,7 +207,7 @@ Summary of Job Statistics
+-----------------------------------------+-----------------------------------------+
| User | jpcoles |
+-----------------------------------------+-----------------------------------------+
| SLURM Account | unknown_account |
| Slurm Account | unknown_account |
+-----------------------------------------+-----------------------------------------+
| Start Time | 03-07-2024 14:54:48 |
+-----------------------------------------+-----------------------------------------+
Expand Down
2 changes: 1 addition & 1 deletion docs/services/firecrest.md
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,7 @@ A staging area is used for external transfers and downloading/uploading a file f
```
!!! Note "Job submission through FirecREST"

FirecREST provides an abstraction for job submission using in the backend the SLURM scheduler of the vCluster.
FirecREST provides an abstraction for job submission using in the backend the Slurm scheduler of the vCluster.

When submitting a job via the different [endpoints](https://firecrest.readthedocs.io/en/latest/reference.html#compute), you should pass the `-l` option to the `/bin/bash` command on the batch file.

Expand Down
2 changes: 1 addition & 1 deletion docs/software/communication/cray-mpich.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ This means that Cray MPICH will automatically be linked to the GTL library, whic

In addition to linking to the GTL library, Cray MPICH must be configured to be GPU-aware at runtime by setting the `MPICH_GPU_SUPPORT_ENABLED=1` environment variable.
On some CSCS systems this option is set by default.
See [this page][ref-slurm-gh200] for more information on configuring SLURM to use GPUs.
See [this page][ref-slurm-gh200] for more information on configuring Slurm to use GPUs.

!!! warning "Segmentation faults when trying to communicate GPU buffers without `MPICH_GPU_SUPPORT_ENABLED=1`"
If you attempt to communicate GPU buffers through MPI without setting `MPICH_GPU_SUPPORT_ENABLED=1`, it will lead to segmentation faults, usually without any specific indication that it is the communication that fails.
Expand Down
6 changes: 3 additions & 3 deletions docs/software/ml/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,9 +315,9 @@ $ exit # (6)!
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
However, this workflow is more involved and intended for advanced Spack users.

## Running PyTorch jobs with SLURM
## Running PyTorch jobs with Slurm

```bash title="slurm sbatch script"
```bash title="Slurm sbatch script"
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --nodes=1
Expand Down Expand Up @@ -383,7 +383,7 @@ srun bash -c "
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
8. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
9. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
9. `RANK` and `LOCAL_RANK` are set per-process by the Slurmjob launcher.
10. Activate the virtual environment created on top of the uenv (if any).
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.

6 changes: 3 additions & 3 deletions docs/software/sciapps/gromacs.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,11 @@ Use `exit` to leave the user environment and return to the original shell.

### How to run

To start a job, 2 bash scripts are required: a standard SLURM submission script, and a [wrapper to start the CUDA MPS daemon][ref-slurm-gh200-single-rank-per-gpu] (in order to have multiple MPI ranks per GPU).
To start a job, 2 bash scripts are required: a standard Slurm submission script, and a [wrapper to start the CUDA MPS daemon][ref-slurm-gh200-single-rank-per-gpu] (in order to have multiple MPI ranks per GPU).

The wrapper script above needs to be made executable with `chmod +x mps-wrapper.sh`.

The SLURM submission script can be adapted from the template below to use the application and the `mps-wrapper.sh` in conjunction.
The Slurm submission script can be adapted from the template below to use the application and the `mps-wrapper.sh` in conjunction.

```bash title="launch.sbatch"
#!/bin/bash
Expand Down Expand Up @@ -106,7 +106,7 @@ This submission script is only representative. Users must run their input files
- Each node has 4 Grace CPUs and 4 Hopper GPUs. When running 8 MPI ranks (meaning two per CPU), keep in mind to not ask for more than 32 OpenMP threads per rank. That way no more than 64 threads will be running on a single CPU.
- Try running both 64 OMP threads x 1 MPI rank and 32 OMP threads x 2 MPI ranks configurations for the test problems and pick the one giving better performance. While using multiple GPUs, the latter can be faster by 5-10%.
- `-update gpu` may not be possible for problems that require constraints on all atoms. In such cases, the update (integration) step will be performed on the CPU. This can lead to performance loss of at least 10% on a single GPU. Due to the overheads of additional data transfers on each step, this will also lead to lower scaling performance on multiple GPUs.
- When running on a single GPU, one can either configure the simulation with 1-2 MPI ranks with `-gpu_id` as `0`, or try running the simulation with a small number of parameters and let GROMACS run with defaults/inferred parameters with a command like the following in the SLURM script:
- When running on a single GPU, one can either configure the simulation with 1-2 MPI ranks with `-gpu_id` as `0`, or try running the simulation with a small number of parameters and let GROMACS run with defaults/inferred parameters with a command like the following in the Slurm script:
`srun ./mps-wrapper.sh -- gmx_mpi mdrun -s input.tpr -ntomp 64`
- Given the compute throughput of each Grace-Hopper module (single CPU+GPU), **for smaller-sized problems, it is possible that a single-GPU run is the fastest**. This may happen when the overheads of domain decomposition, communication and orchestration exceed the benefits of parallelism across multiple GPUs. In our test cases, a single Grace-Hopper module (1 CPU+GPU) has consistently shown a 6-8x performance speedup over a single node on Piz Daint (Intel Xeon Broadwell + P100).
- Try runs with and without specifying the GPU IDs explicitly with `-gpu_id 0123`. For the multi-node case, removing it might yield the best performance.
Expand Down
Loading