Skip to content

Commit fd50309

Browse files
authored
clariden bump (#28)
1 parent 352a5f8 commit fd50309

File tree

2 files changed

+17
-4
lines changed

2 files changed

+17
-4
lines changed

docs/tools/slurm.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,25 @@ SLURM is an open-source, highly scalable job scheduler that allocates computing
99
!!! todo
1010
document `--account`, `--constrant` and other generic flags.
1111

12+
[](){#ref-slurm-partitions}
1213
## Partitions
1314

1415
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
1516

1617
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
1718

19+
[](){#ref-slurm-partition-debug}
20+
### Debug partition
21+
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
22+
23+
[](){#ref-slurm-partition-normal}
24+
### Normal partition
25+
This is the default partition, and will be used when you do not explicitly set a partition. This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
26+
1827
The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
1928

2029
[](){#ref-slurm-gh200}
21-
### NVIDIA GH200 GPU Nodes
30+
## NVIDIA GH200 GPU Nodes
2231

2332
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
2433
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
@@ -40,7 +49,7 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
4049
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
4150

4251
[](){#ref-slurm-gh200-single-rank-per-gpu}
43-
#### One rank per GPU
52+
### One rank per GPU
4453

4554
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
4655
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
@@ -59,7 +68,7 @@ srun <application>
5968
Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
6069

6170
[](){#ref-slurm-gh200-multi-rank-per-gpu}
62-
#### Multiple ranks per GPU
71+
### Multiple ranks per GPU
6372

6473
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
6574
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.

docs/vclusters/clariden.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,11 @@
66

77
This page is a cut and paste of some of Todi's old documentation, which we can turn into a template.
88

9-
## Cluster Details
9+
## Cluster Specification
10+
### Hardware
11+
Clariden consists of ~1200 [Grace-Hopper nodes][ref-alps-gh200-node]. Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
12+
13+
1014

1115
!!! todo
1216
a standardised table with information about

0 commit comments

Comments
 (0)