Skip to content

Commit 8cd17b2

Browse files
committed
wip
1 parent 636d4cd commit 8cd17b2

File tree

3 files changed

+125
-17
lines changed

3 files changed

+125
-17
lines changed

docs/alps/hardware.md

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,19 +80,45 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
8080
[](){#ref-alps-zen2-node}
8181
### AMD Rome CPU Nodes
8282

83-
!!! todo
84-
[confluence link 1](https://confluence.cscs.ch/spaces/KB/pages/850199545/Compute+node+configuration)
83+
These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:
84+
85+
* *Standard-memory*: 256 GB in 16x16 GB DDR4 Dimms.
86+
* *Large-memory*: 512 GB in 16x32 GB DDR4 Dimms.
87+
88+
!!! note "Not all memory is available"
89+
The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.
90+
91+
The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.
92+
93+
A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
94+
{.annotate}
95+
96+
1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.
8597

86-
[confluence link 2](https://confluence.cscs.ch/spaces/KB/pages/850199543/CPU+configuration)
98+
![Screenshot](../images/slurm/eiger-topo.png)
8799

88-
EX425
100+
* The two sockets are labelled Package L#0 and Package L#1.
101+
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
102+
103+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.
104+
105+
* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106+
* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
107+
* Hence core `n` SMT
108+
109+
110+
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
89111

90112
[](){#ref-alps-a100-node}
91113
### NVIDIA A100 GPU Nodes
92114

93-
!!! todo
115+
The Grizzly Peak blades contain two nodes, where each node has:
94116

95-
Grizzly Peak
117+
* One 64-core Zen3 CPU socket
118+
* 512 GB DDR4 Memory
119+
* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
120+
* The MCH system is the same, except the A100 have 96 GB of memory.
121+
* 4 NICs -- one per GPU.
96122

97123
[](){#ref-alps-mi200-node}
98124
### AMD MI250x GPU Nodes

docs/images/slurm/eiger-topo.png

52.1 KB
Loading

docs/running/slurm.md

Lines changed: 93 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,67 @@
44
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
55
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster. It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
66

7-
## Accounting
7+
## Accounts and resources
88

9-
!!! todo
10-
document `--account`, `--constraint` and other generic flags.
9+
Slurm associates each job with a CSCS project in order to perform accounting.
10+
The project to use for accounting is specified using the `--account/-A` flag.
11+
If no job is specified, the primary project is used as the default.
1112

12-
[Confluence link](https://confluence.cscs.ch/spaces/KB/pages/794296413/How+to+run+jobs+on+Eiger)
13+
??? example "Which projects am I a member of?"
14+
Users often are part of multiple projects, and by extension their associated `groupd_id` groups.
15+
You can get a list of your groups using the `id` command in the terminal:
16+
```console
17+
$ id $USER
18+
uid=12345(bobsmith) gid=32819(g152) groups=32819(g152),33119(g174),32336(vasp6)
19+
```
20+
Here the user `bobsmith` is in three projects (`g152`, `g174` and `vasp6`), with the project `g152` being their **primary project**.
21+
22+
??? example "What is my primary project?"
23+
In the terminal, use the following command to find your **primary group**:
24+
```console
25+
$ id -gn $USER
26+
g152
27+
```
28+
29+
```console title="Specifying the account on the command line"
30+
srun -A g123 -n4 -N1 ./run
31+
srun --account=g123 -n4 -N1 ./run
32+
sbatch --account=g123 ./job.sh
33+
```
34+
35+
```bash title="Specifying the account in an sbatch script"
36+
#!/bin/bash
37+
38+
#SBATCH --account=g123
39+
#SBATCH --job-name=example-%j
40+
#SBATCH --time=00:30:00
41+
#SBATCH --nodes=4
42+
...
43+
```
44+
45+
!!! note
46+
The flag `--account` and `-Cmc` that were required on the old Eiger cluster are no longer required.
47+
48+
## Prioritization and scheduling
49+
50+
Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
51+
An aging factor is also applied to each job in the queue to ensure fairness over time.
52+
53+
Since users from various projects are continuously submitting jobs, the relative priority of jobs is dynamic and may change frequently.
54+
As a result, estimated start times are approximate and subject to change based on new job submissions.
55+
56+
Additionally, short-duration jobs may be selected for backfilling — a process where the scheduler fills in available time slots while preparing to run a larger, higher-priority job.
1357

1458
[](){#ref-slurm-partitions}
1559
## Partitions
1660

17-
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
61+
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
62+
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
63+
Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
1864

19-
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
65+
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs.
66+
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
67+
SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
2068

2169
!!! example "How to check the partitions and number of nodes therein?"
2270
You can check the size of the system by running the following command in the terminal:
@@ -165,9 +213,9 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
165213
The "default" mode is used to avoid issues with certain containers.
166214
Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
167215
This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
168-
216+
169217
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.
170-
218+
171219
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
172220
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
173221

@@ -187,7 +235,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat
187235

188236
srun <application>
189237
```
190-
238+
191239
Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
192240

193241
[](){#ref-slurm-gh200-multi-rank-per-gpu}
@@ -258,5 +306,39 @@ The configuration that is optimal for your application may be different.
258306
## AMD CPU Nodes
259307

260308
Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
261-
!!! todo
262-
document how slurm is configured on AMD CPU nodes (e.g. eiger)
309+
310+
For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.
311+
312+
The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .
313+
314+
Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.
315+
316+
Each node has 2 sockets, and each socket contains 4 NUMA nodes.
317+
318+
Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:
319+
320+
* don't spread MPI ranks across multiple sockets
321+
* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific
322+
323+
!! todo "table of basic flags nodes, cores-per-task, etc"
324+
325+
Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
326+
```console title="One MPI rank per socket"
327+
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 ./affinity.mpi
328+
```
329+
330+
If you want only 64, consider the `--hint=nomultithreading` option.
331+
```console title="One MPI rank per socket"
332+
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 --hint=nomultithreading ./affinity.mpi
333+
```
334+
335+
```console title="One MPI rank per NUMA region"
336+
$ OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
337+
```
338+
339+
In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
340+
This is often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.
341+
342+
```console title="One MPI rank per NUMA region"
343+
$ OMP_BIND_PROC=true OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
344+
```

0 commit comments

Comments
 (0)