Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 40 additions & 11 deletions docs/alps/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du

There are currently five node types in Alps:

| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
| ---- | ------- | ------:| -----:| -----------:| -----------:|
| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 |
| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- |
| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 |
| AMD MI250x | mi200 | 12 | 24 | 24 | 96 |
| AMD MI300A | mi300 | 64 | 128 | 512 | 512 |
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
| ---- | ------- | ------:| -----:| -----------:| -----------:|
| [NVIDIA GH200][ref-alps-gh200-node] | gh200 | 1344 | 2688 | 10,752 | 10,752 |
| [AMD Rome][ref-alps-zen2-node] | zen2 | 256 | 1024 | 2,048 | -- |
| [NVIDIA A100][ref-alps-a100-node] | a100 | 72 | 144 | 144 | 576 |
| [AMD MI250x][ref-alps-mi200-node] | mi200 | 12 | 24 | 24 | 96 |
| [AMD MI300A][ref-alps-mi300-node] | mi300 | 64 | 128 | 512 | 512 |

[](){#ref-alps-gh200-node}
### NVIDIA GH200 GPU Nodes
Expand Down Expand Up @@ -80,16 +80,45 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
[](){#ref-alps-zen2-node}
### AMD Rome CPU Nodes

!!! todo
These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:

* *Standard-memory*: 256 GB in 16x16 GB DDR4 Dimms.
* *Large-memory*: 512 GB in 16x32 GB DDR4 Dimms.

!!! note "Not all memory is available"
The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.

The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.

A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
{.annotate}

1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.

EX425
![Screenshot](../images/slurm/eiger-topo.png)

* The two sockets are labelled Package L#0 and Package L#1.
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.

Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.

* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
* Hence core `n` SMT


Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.

[](){#ref-alps-a100-node}
### NVIDIA A100 GPU Nodes

!!! todo
The Grizzly Peak blades contain two nodes, where each node has:

Grizzly Peak
* One 64-core Zen3 CPU socket
* 512 GB DDR4 Memory
* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
* The MCH system is the same, except the A100 have 96 GB of memory.
* 4 NICs -- one per GPU.

[](){#ref-alps-mi200-node}
### AMD MI250x GPU Nodes
Expand Down
Binary file added docs/images/slurm/eiger-topo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
218 changes: 206 additions & 12 deletions docs/running/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,67 @@
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster. It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.

## Accounting
## Accounts and resources

!!! todo
document `--account`, `--constraint` and other generic flags.
Slurm associates each job with a CSCS project in order to perform accounting.
The project to use for accounting is specified using the `--account/-A` flag.
If no job is specified, the primary project is used as the default.

??? example "Which projects am I a member of?"
Users often are part of multiple projects, and by extension their associated `groupd_id` groups.
You can get a list of your groups using the `id` command in the terminal:
```console
$ id $USER
uid=12345(bobsmith) gid=32819(g152) groups=32819(g152),33119(g174),32336(vasp6)
```
Here the user `bobsmith` is in three projects (`g152`, `g174` and `vasp6`), with the project `g152` being their **primary project**.

??? example "What is my primary project?"
In the terminal, use the following command to find your **primary group**:
```console
$ id -gn $USER
g152
```

```console title="Specifying the account on the command line"
srun -A g123 -n4 -N1 ./run
srun --account=g123 -n4 -N1 ./run
sbatch --account=g123 ./job.sh
```

```bash title="Specifying the account in an sbatch script"
#!/bin/bash

#SBATCH --account=g123
#SBATCH --job-name=example-%j
#SBATCH --time=00:30:00
#SBATCH --nodes=4
...
```

!!! note
The flag `--account` and `-Cmc` that were required on the old Eiger cluster are no longer required.

## Prioritization and scheduling

Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
An aging factor is also applied to each job in the queue to ensure fairness over time.

Since users from various projects are continuously submitting jobs, the relative priority of jobs is dynamic and may change frequently.
As a result, estimated start times are approximate and subject to change based on new job submissions.

Additionally, short-duration jobs may be selected for backfilling — a process where the scheduler fills in available time slots while preparing to run a larger, higher-priority job.

[](){#ref-slurm-partitions}
## Partitions

At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.

Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs.
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.

!!! example "How to check the partitions and number of nodes therein?"
You can check the size of the system by running the following command in the terminal:
Expand All @@ -27,7 +77,6 @@ Each type of node has different resource constraints and capabilities, which SLU
```
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).


[](){#ref-slurm-partition-debug}
### Debug partition
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
Expand All @@ -38,6 +87,116 @@ This is the default partition, and will be used when you do not explicitly set a

The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.

## Affinity

The following sections will document how to use Slurm on different compute nodes available on Alps.
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assinged to each MPI rank in a job, and which node they are run on.

We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.

It is straightforward to build the affinity tool to experiment with Slurm configurations.

```console title="Compiling affinity"
$ uenv start prgenv-gnu/24.11:v2 --view=default #(1)
$ git clone https://github.com/bcumming/affinity.git
$ cd affinity; mkdir build; cd build;
$ CC=gcc CXX=g++ cmake .. #(2)
$ CC=gcc CXX=g++ cmake .. -DAFFINITY_GPU=cuda #(3)
$ CC=gcc CXX=g++ cmake .. -DAFFINITY_GPU=rocm #(4)
```

1. Affinity can be built using [`prgenv-gnu`][ref-uenv-prgenv-gnu] on all clusters.

2. By default affinity will build with MPI support and no GPU support: configure with no additional arguments on a CPU-only system like [Eiger][ref-cluster-eiger].

3. Enable CUDA support on systems that provide NVIDIA GPUs.

4. Enable ROCM support on systems that provide AMD GPUs.

The build generates the following executables:

* `affinity.omp`: tests thread affinity with no MPI (always built).
* `affinity.mpi`: tests thread affinity with MPI (built by default).
* `affinity.cuda`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=cuda`).
* `affinity.rocm`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=rocm`).

??? example "Testing CPU affinity"
Test CPU affinity (this can be used on both CPU and GPU enabled nodes).
```console
$ uenv start prgenv-gnu/24.11:v2 --view=default
$ srun -n8 -N2 -c72 ./affinity.mpi
affinity test for 8 MPI ranks
rank 0 @ nid006363: threads [ 0:71] -> cores [ 0: 71]
rank 1 @ nid006363: threads [ 0:71] -> cores [ 72:143]
rank 2 @ nid006363: threads [ 0:71] -> cores [144:215]
rank 3 @ nid006363: threads [ 0:71] -> cores [216:287]
rank 4 @ nid006375: threads [ 0:71] -> cores [ 0: 71]
rank 5 @ nid006375: threads [ 0:71] -> cores [ 72:143]
rank 6 @ nid006375: threads [ 0:71] -> cores [144:215]
rank 7 @ nid006375: threads [ 0:71] -> cores [216:287]
```

In this example there are 8 MPI ranks:

* ranks `0:3` are on node `nid006363`;
* ranks `4:7` are on node `nid006375`;
* each rank has 72 threads numbered `0:71`;
* all threads on each rank have affinity with the same 72 cores;
* each rank gets 72 cores, e.g. rank 1 gets cores `72:143` on node `nid006363`.



??? example "Testing GPU affinity"
Use `affinity.cuda` or `affinity.rocm` to test on GPU-enabled systems.

```console
$ srun -n4 -N1 ./affinity.cuda #(1)
GPU affinity test for 4 MPI ranks
rank 0 @ nid005555
cores : [0:7]
gpu 0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
gpu 1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
gpu 2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
gpu 3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
rank 1 @ nid005555
cores : [72:79]
gpu 0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
gpu 1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
gpu 2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
gpu 3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
rank 2 @ nid005555
cores : [144:151]
gpu 0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
gpu 1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
gpu 2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
gpu 3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
rank 3 @ nid005555
cores : [216:223]
gpu 0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
gpu 1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
gpu 2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
gpu 3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
$ srun -n4 -N1 --gpus-per-task=1 ./affinity.cuda #(2)
GPU affinity test for 4 MPI ranks
rank 0 @ nid005675
cores : [0:7]
gpu 0 : GPU-a16a8dac-7661-a44b-c6f8-f783f6e812d3
rank 1 @ nid005675
cores : [72:79]
gpu 0 : GPU-ca5160ac-2c1e-ff6c-9cec-e7ce5c9b2d09
rank 2 @ nid005675
cores : [144:151]
gpu 0 : GPU-496a2216-8b3c-878e-e317-36e69af11161
rank 3 @ nid005675
cores : [216:223]
gpu 0 : GPU-766e3b8b-fa19-1480-b02f-0dfd3f2c87ff
```

1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs.

2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank.

[](){#ref-slurm-gh200}
## NVIDIA GH200 GPU Nodes

Expand All @@ -54,9 +213,9 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
The "default" mode is used to avoid issues with certain containers.
Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.

Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.

If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.

Expand All @@ -76,7 +235,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat

srun <application>
```

Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.

[](){#ref-slurm-gh200-multi-rank-per-gpu}
Expand Down Expand Up @@ -144,7 +303,42 @@ The configuration that is optimal for your application may be different.
[NVIDIA's Multi-Process Service (MPS)]: https://docs.nvidia.com/deploy/mps/index.html

[](){#ref-slurm-amdcpu}
## AMD CPU
## AMD CPU Nodes

Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].

!!! todo
document how slurm is configured on AMD CPU nodes (e.g. eiger)
For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.

The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .

Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.

Each node has 2 sockets, and each socket contains 4 NUMA nodes.

Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:

* don't spread MPI ranks across multiple sockets
* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific

!! todo "table of basic flags nodes, cores-per-task, etc"

Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
```console title="One MPI rank per socket"
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 ./affinity.mpi
```

If you want only 64, consider the `--hint=nomultithreading` option.
```console title="One MPI rank per socket"
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 --hint=nomultithreading ./affinity.mpi
```

```console title="One MPI rank per NUMA region"
$ OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
```

In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
This is often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.

```console title="One MPI rank per NUMA region"
$ OMP_BIND_PROC=true OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
```