You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -27,7 +29,6 @@ Each type of node has different resource constraints and capabilities, which SLU
27
29
```
28
30
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
29
31
30
-
31
32
[](){#ref-slurm-partition-debug}
32
33
### Debug partition
33
34
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
@@ -38,6 +39,116 @@ This is the default partition, and will be used when you do not explicitly set a
38
39
39
40
The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
40
41
42
+
## Affinity
43
+
44
+
The following sections will document how to use Slurm on different compute nodes available on Alps.
45
+
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assinged to each MPI rank in a job, and which node they are run on.
46
+
47
+
We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
48
+
Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.
49
+
50
+
It is straightforward to build the affinity tool to experiment with Slurm configurations.
1. Affinity can be built using [`prgenv-gnu`][ref-uenv-prgenv-gnu] on all clusters.
62
+
63
+
2. By default affinity will build with MPI support and no GPU support: configure with no additional arguments on a CPU-only system like [Eiger][ref-cluster-eiger].
64
+
65
+
3. Enable CUDA support on systems that provide NVIDIA GPUs.
66
+
67
+
4. Enable ROCM support on systems that provide AMD GPUs.
68
+
69
+
The build generates the following executables:
70
+
71
+
*`affinity.omp`: tests thread affinity with no MPI (always built).
72
+
*`affinity.mpi`: tests thread affinity with MPI (built by default).
73
+
*`affinity.cuda`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=cuda`).
74
+
*`affinity.rocm`: tests thread and GPU affinity with MPI (built with `-DAFFINITY_GPU=rocm`).
75
+
76
+
??? example "Testing CPU affinity"
77
+
Test CPU affinity (this can be used on both CPU and GPU enabled nodes).
1. Test GPU affinity: note how all 4 ranks see the same 4 GPUs.
149
+
150
+
2. Test GPU affinity: note how the `--gpus-per-task=1` parameter assings a unique GPU to each rank.
151
+
41
152
[](){#ref-slurm-gh200}
42
153
## NVIDIA GH200 GPU Nodes
43
154
@@ -144,7 +255,8 @@ The configuration that is optimal for your application may be different.
144
255
[NVIDIA's Multi-Process Service (MPS)]: https://docs.nvidia.com/deploy/mps/index.html
145
256
146
257
[](){#ref-slurm-amdcpu}
147
-
## AMD CPU
258
+
## AMD CPU Nodes
148
259
260
+
Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
149
261
!!! todo
150
262
document how slurm is configured on AMD CPU nodes (e.g. eiger)
0 commit comments