Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/clusters/bristen.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The n
|-----------|--------| ----------------- | ---------- |
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |

Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].
Nodes are in the [`normal` Slurm partition][ref-slurm-partition-normal].

### Storage and file systems

Expand Down Expand Up @@ -48,7 +48,7 @@ Users are encouraged to use containers on Bristen.

Bristen uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There is currently a single slurm partition on the system:
There is currently a single Slurm partition on the system:

* the `normal` partition is for all production workloads.
+ nodes in this partition are not shared.
Expand Down
4 changes: 2 additions & 2 deletions docs/clusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The number of nodes can change when nodes are added or removed from other cluste
|-----------|--------| ----------------- | ---------- |
| [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 |

Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
Most nodes are in the [`normal` Slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].

### Storage and file systems

Expand Down Expand Up @@ -71,7 +71,7 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy

Clariden uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There are two slurm partitions on the system:
There are two Slurm partitions on the system:

* the `normal` partition is for all production workloads.
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
Expand Down
6 changes: 3 additions & 3 deletions docs/running/jobreport.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ The report is divided into two parts: a general summary and GPU specific values.
| Field | Description |
| ----- | ----------- |
| Job Id | The Slurm job id |
| Step Id | The slurm step id. A job step in Slurm is a subdivision of a job started with srun |
| Step Id | The Slurm step id. A job step in Slurm is a subdivision of a job started with srun |
| User | The user account that submitted the job |
| Slurm Account | The project account that will be billed |
| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran |
Expand All @@ -77,7 +77,7 @@ The report is divided into two parts: a general summary and GPU specific values.
| SM Utilization % | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel |
| Memory Utilization % | The percentage of process's lifetime during which global (device) memory was being read or written |

## Example with slurm: srun
## Example with Slurm: srun

The simplest example to test `jobreport` is to run it with the sleep command.
It is important to separate `jobreport` (and its options) and your command with `--`.
Expand Down Expand Up @@ -155,7 +155,7 @@ GPU Specific Values
4. Uncheck "Set locale environment variables on startup"
5. Quit and reopen the terminal and try again. This should fix the issue.

## Example with slurm: batch script
## Example with Slurm: batch script

The `jobreport` command can be used in a batch script
The report printing, too, can be included in the script and does not need the `srun` command.
Expand Down
2 changes: 1 addition & 1 deletion docs/software/communication/openmpi.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uen
Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent.
At runtime, some additional options must be set to correctly use the Slingshot network.

First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching.
First, when launching applications through Slurm, [PMIx](https://pmix.github.com) must be used for application launching.
This is done with the `--mpi` flag of `srun`:
```bash
srun --mpi=pmix ...
Expand Down
6 changes: 3 additions & 3 deletions docs/software/sciapps/cp2k.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ On our systems, CP2K is built with the following dependencies:

### Running on the HPC platform

To start a job, two bash scripts are potentially required: a [slurm] submission script, and a wrapper to start the [CUDA
To start a job, two bash scripts are potentially required: a [Slurm] submission script, and a wrapper to start the [CUDA
MPS] daemon so that multiple MPI ranks can use the same GPU.

```bash title="run_cp2k.sh"
Expand Down Expand Up @@ -138,7 +138,7 @@ sbatch run_cp2k.sh

Each GH200 node has 4 modules, each of them composed of a ARM Grace CPU with 72 cores and a H200 GPU directly
attached to it. Please see [Alps hardware][ref-alps-hardware] for more information.
It is important that the number of MPI ranks passed to [slurm] with `--ntasks-per-node` is a multiple of 4.
It is important that the number of MPI ranks passed to [Slurm] with `--ntasks-per-node` is a multiple of 4.

??? note

Expand Down Expand Up @@ -524,5 +524,5 @@ As a workaround, you can disable CUDA acceleration for the grid backend:
[OpenBLAS]: http://www.openmathlib.org/OpenBLAS/
[Intel MKL]: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
[Cray MPICH]: https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/
[slurm]: https://slurm.schedmd.com/
[Slurm]: https://slurm.schedmd.com/
[CUDA MPS]: https://docs.nvidia.com/deploy/mps/index.html
2 changes: 1 addition & 1 deletion docs/software/uenv/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ This is very useful for interactive sessions, for example if you want to work in
$ make -j

# run the affinity executable on two nodes - note how the uenv is
# automatically loaded by slurm on the compute nodes, because CUDA and MPI from
# automatically loaded by Slurm on the compute nodes, because CUDA and MPI from
# the uenv are required to run.
$ srun -n2 -N2 ./affinity.cuda
GPU affinity test for 2 MPI ranks
Expand Down