You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/alps/hardware.md
+32-6Lines changed: 32 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,19 +80,45 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
80
80
[](){#ref-alps-zen2-node}
81
81
### AMD Rome CPU Nodes
82
82
83
-
!!! todo
84
-
[confluence link 1](https://confluence.cscs.ch/spaces/KB/pages/850199545/Compute+node+configuration)
83
+
These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:
84
+
85
+
**Standard-memory*: 256 GB in 16x16 GB DDR4 Dimms.
86
+
**Large-memory*: 512 GB in 16x32 GB DDR4 Dimms.
87
+
88
+
!!! note "Not all memory is available"
89
+
The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.
90
+
91
+
The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.
92
+
93
+
A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
94
+
{.annotate}
95
+
96
+
1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.
85
97
86
-
[confluence link 2](https://confluence.cscs.ch/spaces/KB/pages/850199543/CPU+configuration)
98
+

87
99
88
-
EX425
100
+
* The two sockets are labelled Package L#0 and Package L#1.
101
+
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
102
+
103
+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.
104
+
105
+
* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106
+
* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
107
+
* Hence core `n` SMT
108
+
109
+
110
+
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
89
111
90
112
[](){#ref-alps-a100-node}
91
113
### NVIDIA A100 GPU Nodes
92
114
93
-
!!! todo
115
+
The Grizzly Peak blades contain two nodes, where each node has:
94
116
95
-
Grizzly Peak
117
+
* One 64-core Zen3 CPU socket
118
+
* 512 GB DDR4 Memory
119
+
* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
120
+
* The MCH system is the same, except the A100 have 96 GB of memory.
Copy file name to clipboardExpand all lines: docs/running/slurm.md
+93-11Lines changed: 93 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,19 +4,67 @@
4
4
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
5
5
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster. It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
6
6
7
-
## Accounting
7
+
## Accounts and resources
8
8
9
-
!!! todo
10
-
document `--account`, `--constraint` and other generic flags.
9
+
Slurm associates each job with a CSCS project in order to perform accounting.
10
+
The project to use for accounting is specified using the `--account/-A` flag.
11
+
If no job is specified, the primary project is used as the default.
Here the user `bobsmith` is in three projects (`g152`, `g174` and `vasp6`), with the project `g152` being their **primary project**.
21
+
22
+
??? example "What is my primary project?"
23
+
In the terminal, use the following command to find your **primary group**:
24
+
```console
25
+
$ id -gn $USER
26
+
g152
27
+
```
28
+
29
+
```console title="Specifying the account on the command line"
30
+
srun -A g123 -n4 -N1 ./run
31
+
srun --account=g123 -n4 -N1 ./run
32
+
sbatch --account=g123 ./job.sh
33
+
```
34
+
35
+
```bash title="Specifying the account in an sbatch script"
36
+
#!/bin/bash
37
+
38
+
#SBATCH --account=g123
39
+
#SBATCH --job-name=example-%j
40
+
#SBATCH --time=00:30:00
41
+
#SBATCH --nodes=4
42
+
...
43
+
```
44
+
45
+
!!! note
46
+
The flag `--account` and `-Cmc` that were required on the old Eiger cluster are no longer required.
47
+
48
+
## Prioritization and scheduling
49
+
50
+
Job priorities are determined based on each project's resource usage relative to its quarterly allocation, as well as in comparison to other projects.
51
+
An aging factor is also applied to each job in the queue to ensure fairness over time.
52
+
53
+
Since users from various projects are continuously submitting jobs, the relative priority of jobs is dynamic and may change frequently.
54
+
As a result, estimated start times are approximate and subject to change based on new job submissions.
55
+
56
+
Additionally, short-duration jobs may be selected for backfilling — a process where the scheduler fills in available time slots while preparing to run a larger, higher-priority job.
13
57
14
58
[](){#ref-slurm-partitions}
15
59
## Partitions
16
60
17
-
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
61
+
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
62
+
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
63
+
Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
18
64
19
-
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
65
+
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs.
66
+
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
67
+
SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
20
68
21
69
!!! example "How to check the partitions and number of nodes therein?"
22
70
You can check the size of the system by running the following command in the terminal:
@@ -165,9 +213,9 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
165
213
The "default" mode is used to avoid issues with certain containers.
166
214
Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
167
215
This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
168
-
216
+
169
217
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.
170
-
218
+
171
219
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
172
220
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
173
221
@@ -187,7 +235,7 @@ The examples below launch jobs on two nodes with four ranks per node using `sbat
187
235
188
236
srun <application>
189
237
```
190
-
238
+
191
239
Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
192
240
193
241
[](){#ref-slurm-gh200-multi-rank-per-gpu}
@@ -258,5 +306,39 @@ The configuration that is optimal for your application may be different.
258
306
## AMD CPU Nodes
259
307
260
308
Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
261
-
!!! todo
262
-
document how slurm is configured on AMD CPU nodes (e.g. eiger)
309
+
310
+
For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.
311
+
312
+
The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .
313
+
314
+
Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.
315
+
316
+
Each node has 2 sockets, and each socket contains 4 NUMA nodes.
317
+
318
+
Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:
319
+
320
+
* don't spread MPI ranks across multiple sockets
321
+
* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific
322
+
323
+
!! todo "table of basic flags nodes, cores-per-task, etc"
324
+
325
+
Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
0 commit comments