Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/guides/gb2025.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Gordon Bell and HPL runs 2025

For Gordon Bell and HPL runs in March-April 2025, CSCS has created a reservation on Santis with 1333 nodes (12 cabinets).

For the runs, CSCS has applied some updates and changes that aim to improve performance and scaling scale, particularly for NCCL.
If you are already familiar with running on Daint, you might have to make some small changes to your current job scripts and parameters, which will be documented here.

## Santis

### Connecting

Connecting to Santis via SSH is the same as for Daint and Clariden, see the [ssh guide][ref-ssh] for more information.

Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Santis using `ssh santis`.
```
Host santis
HostName santis.alps.cscs.ch
ProxyJump ela
# change cscsusername to your CSCS username
User cscsusername
IdentityFile ~/.ssh/cscs-key
IdentitiesOnly yes
```

### Reservations

The `normal` partition is used with no reservation, which means that that jobs can be submittied without `--partition` and `--reservation` flags.

### Storage

Your data sets from Daint are available on Santis

* the same Home is shared between Daint, Clariden and Santis
* the same Scratch is mounted on both Santis and Daint
* Store/Project are also mounted.

## Low Noise Mode

Low noise mode (LNM) is now enabled.
This confines system processes and operations to the first core of each of the four NUMA regions in a node (i.e., cores 0, 72, 144, 216).

The consequence of this setting is that only 71 cores per socket can be requested by an application (for a total of 284 cores instead of 288 cores per node).

!!! warning "Unable to allocate resources: Requested node configuration is not availabl"
If you try to use all 72 cores on each socket, SLURM will give a hard error, because only 71 are available:

```
# try to run 4 ranks per node, with 72 cores each
> srun -n4 -N1 -c72 --reservation=reshuffling ./build/affinity.mpi
srun: error: Unable to allocate resources: Requested node configuration is not available
```

One consequence of this change is that thread affinity and OpenMP settings that worked on Daint might cause large slowdown in the new configuration.

### SLURM

Explicitly set the number of cores per task using the `--cores-per-task/-c` flag, e.g.:
```
#SBATCH --cores-per-task=64
#SBATCH --cores-per-task=71
```
or
```
srun -N1 -n4 -c71 ...
```

**Do not** use the `--cpu-bind` flag to control affinity

* this can cause large slowdown, particularly with `--cpu-bind=socket`. We are investigating how to fix this.

If you see significant slowdown and you want to report it, please provide the output of using the `--cpu-bind=verbose` flag.

### OpenMP

If your application uses OpenMP, try setting the following in your job script:

```bash
export OPENMP_PLACES=cores
export OPENMP_PROC_BIND=close
```

Without these settings, we have observed application slowdown due to poor thread placement.

## NCCL

!!! todo
write a guide on which versions to use, environment variables to set, etc.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ nav:
- guides/index.md
- 'Internet Access on Alps': guides/internet-access.md
- 'Storage': guides/storage.md
- 'Gordon Bell 2025': guides/gb2025.md
- 'Policies':
- policies/index.md
- 'User Regulations': policies/regulations.md
Expand Down