eth-cscs · bcumming · Jun 25, 2025 · Jun 24, 2025 · Jun 24, 2025 · Jun 25, 2025
@@ -40,13 +40,13 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du
 
 There are currently five node types in Alps:
 
-| type           | abbreviation  | blades | nodes | CPU sockets | GPU devices |
-| ----           | -------       | ------:| -----:| -----------:| -----------:|
-| NVIDIA GH200   | gh200         | 1344   | 2688  | 10,752      | 10,752      |
-| AMD Rome       | zen2          |  256   | 1024  |  2,048      | --          |
-| NVIDIA A100    | a100          |   72   |  144  |    144      | 576         |
-| AMD MI250x     | mi200         |   12   |   24  |     24      |  96         |
-| AMD MI300A     | mi300         |   64   |  128  |    512      | 512         |
+| type                                | abbreviation  | blades | nodes | CPU sockets | GPU devices |
+| ----                                | -------       | ------:| -----:| -----------:| -----------:|
+| [NVIDIA GH200][ref-alps-gh200-node] | gh200         | 1344   | 2688  | 10,752      | 10,752      |
+| [AMD Rome][ref-alps-zen2-node]      | zen2          |  256   | 1024  |  2,048      | --          |
+| [NVIDIA A100][ref-alps-a100-node]   | a100          |   72   |  144  |    144      | 576         |
+| [AMD MI250x][ref-alps-mi200-node]   | mi200         |   12   |   24  |     24      |  96         |
+| [AMD MI300A][ref-alps-mi300-node]   | mi300         |   64   |  128  |    512      | 512         |
 
 [](){#ref-alps-gh200-node}
 ### NVIDIA GH200 GPU Nodes
@@ -80,16 +80,45 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
 [](){#ref-alps-zen2-node}
 ### AMD Rome CPU Nodes
 
-!!! todo
+These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:
+
+* *Standard-memory*:  256 GB in 16x16 GB DDR4 Dimms.
+* *Large-memory*:  512 GB in 16x32 GB DDR4 Dimms.
+
+!!! note "Not all memory is available"
+    The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.
+
+    The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.
+
+A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
+{.annotate}
+
+1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.
 
-EX425
+![Screenshot](../images/slurm/eiger-topo.png)
+
+* The two sockets are labelled Package L#0 and Package L#1.
+* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
+
+Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.
+
+* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
+* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
+* Hence core `n` SMT
+
+
+Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
 
 [](){#ref-alps-a100-node}
 ### NVIDIA A100 GPU Nodes
 
-!!! todo
+The Grizzly Peak blades contain two nodes, where each node has:
 
-Grizzly Peak
+* One 64-core Zen3 CPU socket
+* 512 GB DDR4 Memory
+* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
+    * The MCH system is the same, except the A100 have 96 GB of memory.
+* 4 NICs -- one per GPU.
 
 [](){#ref-alps-mi200-node}
 ### AMD MI250x GPU Nodes

@@ -4,17 +4,67 @@
 CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
 SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster. It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
 
-## Accounting
+## Accounts and resources
 
-!!! todo
-    document `--account`, `--constraint` and other generic flags.
+Slurm associates each job with a CSCS project in order to perform accounting.
+The project to use for accounting is specified using the `--account/-A` flag.
+If no job is specified, the primary project is used as the default.
+
+??? example "Which projects am I a member of?"
+    Users often are part of multiple projects, and by extension their associated `groupd_id` groups.
+    You can get a list of your groups using the `id` command in the terminal:
+    ```console
+    $ id $USER
+    uid=12345(bobsmith) gid=32819(g152) groups=32819(g152),33119(g174),32336(vasp6)
+    ```
+    Here the user `bobsmith` is in three projects (`g152`, `g174` and `vasp6`), with the project `g152` being their **primary project**.
+
+??? example "What is my primary project?"
+    In the terminal, use the following command to find your **primary group**:
+    ```console
+    $ id -gn $USER
+    g152
+    ```
+
+```console title="Specifying the account on the command line"
+srun -A g123        -n4 -N1 ./run
+srun --account=g123 -n4 -N1 ./run
+sbatch --account=g123 ./job.sh
+```
+
+```bash title="Specifying the account in an sbatch script"
+#!/bin/bash
+
+#SBATCH --account=g123
+#SBATCH --job-name=example-%j
+#SBATCH --time=00:30:00
+#SBATCH --nodes=4
+...
+```
+
+!!! note
+    The flag `--account` and `-Cmc` that were required on the old Eiger cluster are no longer required.
+
+## Prioritization and scheduling
+
+Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
+An aging factor is also applied to each job in the queue to ensure fairness over time.
+
+Since users from various projects are continuously submitting jobs, the relative priority of jobs is dynamic and may change frequently.
+As a result, estimated start times are approximate and subject to change based on new job submissions.
+
+Additionally, short-duration jobs may be selected for backfilling — a process where the scheduler fills in available time slots while preparing to run a larger, higher-priority job.
 
 [](){#ref-slurm-partitions}
 ## Partitions
 
-At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
+At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
+These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
+Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
 
-Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
+Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs.
+For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
+SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
 
 !!! example "How to check the partitions and number of nodes therein?"
     You can check the size of the system by running the following command in the terminal:
@@ -27,7 +77,6 @@ Each type of node has different resource constraints and capabilities, which SLU
     ```
     The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
 
-
 [](){#ref-slurm-partition-debug}
 ### Debug partition
 The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
@@ -38,6 +87,116 @@ This is the default partition, and will be used when you do not explicitly set a
 
 The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
 
+## Affinity
+
+The following sections will document how to use Slurm on different compute nodes available on Alps.
+To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assinged to each MPI rank in a job, and which node they are run on.
+
+We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
+Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.
+
+It is straightforward to build the affinity tool to experiment with Slurm configurations.
+
+```console title="Compiling affinity"
+$ uenv start prgenv-gnu/24.11:v2 --view=default     #(1)
+$ git clone https://github.com/bcumming/affinity.git
+$ cd affinity; mkdir build; cd build;
+$ CC=gcc CXX=g++ cmake ..                           #(2)
+$ CC=gcc CXX=g++ cmake .. -DAFFINITY_GPU=cuda       #(3)
+$ CC=gcc CXX=g++ cmake .. -DAFFINITY_GPU=rocm       #(4)
+```
+
+1. Affinity can be built using [`prgenv-gnu`][ref-uenv-prgenv-gnu] on all clusters.
+
+2. By default affinity will build with MPI support and no GPU support: configure with no additional arguments on a CPU-only system like [Eiger][ref-cluster-eiger].
+
+3. Enable CUDA support on systems that provide NVIDIA GPUs.
+
+4. Enable ROCM support on systems that provide AMD GPUs.
+
+The build generates the following executables:
+
+* `affinity.omp`: tests thread affinity with no MPI (always built).
+* `affinity.mpi`: tests thread affinity with MPI (built by default).
+* `affinity.cuda`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=cuda`).
+* `affinity.rocm`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=rocm`).
+
+??? example "Testing CPU affinity"
+    Test CPU affinity (this can be used on both CPU and GPU enabled nodes).
+    ```console
+    $ uenv start prgenv-gnu/24.11:v2 --view=default
+    $ srun -n8 -N2 -c72 ./affinity.mpi
+    affinity test for 8 MPI ranks
+    rank   0 @ nid006363: threads [ 0:71] -> cores [  0: 71]
+    rank   1 @ nid006363: threads [ 0:71] -> cores [ 72:143]
+    rank   2 @ nid006363: threads [ 0:71] -> cores [144:215]
+    rank   3 @ nid006363: threads [ 0:71] -> cores [216:287]
+    rank   4 @ nid006375: threads [ 0:71] -> cores [  0: 71]
+    rank   5 @ nid006375: threads [ 0:71] -> cores [ 72:143]
+    rank   6 @ nid006375: threads [ 0:71] -> cores [144:215]
+    rank   7 @ nid006375: threads [ 0:71] -> cores [216:287]
+    ```
+
+    In this example there are 8 MPI ranks:
+
+    * ranks `0:3` are on node `nid006363`;
+    * ranks `4:7` are on node `nid006375`;
+    * each rank has 72 threads numbered `0:71`;
+    * all threads on each rank have affinity with the same 72 cores;
+    * each rank gets 72 cores, e.g. rank 1 gets cores `72:143` on node `nid006363`.
+
+
+
+??? example "Testing GPU affinity"
+    Use `affinity.cuda` or `affinity.rocm` to test on GPU-enabled systems.
+
+    ```console
+    $ srun -n4 -N1 ./affinity.cuda                      #(1)
+    GPU affinity test for 4 MPI ranks
+    rank      0 @ nid005555
+     cores   : [0:7]
+     gpu   0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
+     gpu   1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
+     gpu   2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
+     gpu   3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
+    rank      1 @ nid005555
+     cores   : [72:79]
+     gpu   0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
+     gpu   1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
+     gpu   2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
+     gpu   3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
+    rank      2 @ nid005555
+     cores   : [144:151]
+     gpu   0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
+     gpu   1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
+     gpu   2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
+     gpu   3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
+    rank      3 @ nid005555
+     cores   : [216:223]
+     gpu   0 : GPU-2ae325c4-b542-26c2-d10f-c4d84847f461
+     gpu   1 : GPU-5923dec6-288f-4418-f485-666b93f5f244
+     gpu   2 : GPU-170b8198-a3e1-de6a-ff82-d440f71c05da
+     gpu   3 : GPU-0e184efb-1d1f-f278-b96d-15bc8e5f17be
+    $ srun -n4 -N1 --gpus-per-task=1 ./affinity.cuda    #(2)
+    GPU affinity test for 4 MPI ranks
+    rank      0 @ nid005675
+     cores   : [0:7]
+     gpu   0 : GPU-a16a8dac-7661-a44b-c6f8-f783f6e812d3
+    rank      1 @ nid005675
+     cores   : [72:79]
+     gpu   0 : GPU-ca5160ac-2c1e-ff6c-9cec-e7ce5c9b2d09
+    rank      2 @ nid005675
+     cores   : [144:151]
+     gpu   0 : GPU-496a2216-8b3c-878e-e317-36e69af11161
+    rank      3 @ nid005675
+     cores   : [216:223]
+     gpu   0 : GPU-766e3b8b-fa19-1480-b02f-0dfd3f2c87ff
+    ```
+
+    1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs.
+
+    2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank.
+
 [](){#ref-slurm-gh200}
 ## NVIDIA GH200 GPU Nodes
 
@@ -54,9 +213,9 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
     The "default" mode is used to avoid issues with certain containers.
     Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
     This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
-    
+
     Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.
-    
+
     If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
     If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU. 
 
@@ -76,7 +235,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat
 
 srun <application>
 ```
-    
+
 Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
 
 [](){#ref-slurm-gh200-multi-rank-per-gpu}
@@ -144,7 +303,42 @@ The configuration that is optimal for your application may be different.
 [NVIDIA's Multi-Process Service (MPS)]: https://docs.nvidia.com/deploy/mps/index.html
 
 [](){#ref-slurm-amdcpu}
-## AMD CPU
+## AMD CPU Nodes
+
+Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
 
-!!! todo
-    document how slurm is configured on AMD CPU nodes (e.g. eiger)
+For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.
+
+The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .
+
+Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.
+
+Each node has 2 sockets, and each socket contains 4 NUMA nodes.
+
+Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:
+
+* don't spread MPI ranks across multiple sockets
+* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific
+
+!! todo "table of basic flags nodes, cores-per-task, etc"
+
+Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
+```console title="One MPI rank per socket"
+$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 ./affinity.mpi
+```
+
+If you want only 64, consider the `--hint=nomultithreading` option.
+```console title="One MPI rank per socket"
+$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 --hint=nomultithreading ./affinity.mpi
+```
+
+```console title="One MPI rank per NUMA region"
+$ OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
+```
+
+In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
+This is often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.
+
+```console title="One MPI rank per NUMA region"
+$ OMP_BIND_PROC=true OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
+```