Skip to content

Commit ac563e9

Browse files
committed
text formatting; SLURM -> Slurm
1 parent 8f56190 commit ac563e9

File tree

2 files changed

+21
-16
lines changed

2 files changed

+21
-16
lines changed

docs/running/slurm.md

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[](){#ref-slurm}
2-
# SLURM
2+
# Slurm
33

4-
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
4+
CSCS uses the [Slurm](https://slurm.schedmd.com/documentation.html) workload manager to efficiently schedule and manage jobs on Alps vClusters.
55
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
66
It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
77

@@ -33,9 +33,13 @@ It supports advanced scheduling policies, job dependencies, resource reservation
3333
[](){#ref-slurm-partitions}
3434
## Partitions
3535

36-
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters. These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs. Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
36+
At CSCS, Slurm is configured to accommodate the diverse range of node types available in our HPC clusters.
37+
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
38+
Because of this heterogeneity, Slurm must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
3739

38-
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
40+
Each type of node has different resource constraints and capabilities, which Slurm takes into account when scheduling jobs.
41+
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
42+
Slurm ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
3943

4044
!!! example "How to check the partitions and number of nodes therein?"
4145
You can check the size of the system by running the following command in the terminal:
@@ -51,24 +55,26 @@ Each type of node has different resource constraints and capabilities, which SLU
5155

5256
[](){#ref-slurm-partition-debug}
5357
### Debug partition
54-
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
58+
The Slurm `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
5559

5660
[](){#ref-slurm-partition-normal}
5761
### Normal partition
58-
This is the default partition, and will be used when you do not explicitly set a partition. This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
62+
This is the default partition, and will be used when you do not explicitly set a partition.
63+
This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
5964

60-
The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
65+
The following sections will provide detailed guidance on how to use Slurm to request and manage CPU cores, memory, and GPUs in jobs.
66+
These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
6167

6268
[](){#ref-slurm-gh200}
6369
## NVIDIA GH200 GPU Nodes
6470

65-
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
71+
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and Slurm job submissions must be configured appropriately to best make use of the resources.
6672
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
67-
[Configuring SLURM jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
73+
[Configuring Slurm jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
6874
Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
6975

70-
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
71-
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific SLURM configurations.
76+
The best Slurm configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
77+
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific Slurm configurations.
7278

7379
!!! warning
7480
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
@@ -84,7 +90,7 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
8490
[](){#ref-slurm-gh200-single-rank-per-gpu}
8591
### One rank per GPU
8692

87-
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
93+
Configuring Slurm to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` Slurm flags.
8894
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
8995
The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
9096

@@ -104,7 +110,7 @@ Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, wh
104110
### Multiple ranks per GPU
105111

106112
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
107-
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
113+
In these cases Slurm jobs must be configured to assign multiple ranks to a single GPU.
108114
This is best done using [NVIDIA's Multi-Process Service (MPS)].
109115
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
110116

@@ -167,8 +173,7 @@ The configuration that is optimal for your application may be different.
167173
[](){#ref-slurm-amdcpu}
168174
## AMD CPU
169175

170-
!!! todo
171-
document how slurm is configured on AMD CPU nodes (e.g. eiger)
176+
!!! todo "document how Slurm is configured on AMD CPU nodes (e.g. [eiger][ref-cluster-eiger])"
172177

173178
[](){#ref-slurm-over-subscription}
174179
## Node over-subscription

mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ nav:
100100
- 'JupyterLab': services/jupyterlab.md
101101
- 'Running Jobs':
102102
- running/index.md
103-
- 'slurm': running/slurm.md
103+
- 'Slurm': running/slurm.md
104104
- 'Job report': running/jobreport.md
105105
- 'Data Management and Storage':
106106
- storage/index.md

0 commit comments

Comments
 (0)