diff --git a/docs/guides/gb2025.md b/docs/guides/gb2025.md new file mode 100644 index 00000000..7a1ea414 --- /dev/null +++ b/docs/guides/gb2025.md @@ -0,0 +1,87 @@ +# Gordon Bell and HPL runs 2025 + +For Gordon Bell and HPL runs in March-April 2025, CSCS has created a reservation on Santis with 1333 nodes (12 cabinets). + +For the runs, CSCS has applied some updates and changes that aim to improve performance and scaling scale, particularly for NCCL. +If you are already familiar with running on Daint, you might have to make some small changes to your current job scripts and parameters, which will be documented here. + +## Santis + +### Connecting + +Connecting to Santis via SSH is the same as for Daint and Clariden, see the [ssh guide][ref-ssh] for more information. + +Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Santis using `ssh santis`. +``` +Host santis + HostName santis.alps.cscs.ch + ProxyJump ela +# change cscsusername to your CSCS username + User cscsusername + IdentityFile ~/.ssh/cscs-key + IdentitiesOnly yes +``` + +### Reservations + +The `normal` partition is used with no reservation, which means that that jobs can be submittied without `--partition` and `--reservation` flags. + +### Storage + +Your data sets from Daint are available on Santis + +* the same Home is shared between Daint, Clariden and Santis +* the same Scratch is mounted on both Santis and Daint +* Store/Project are also mounted. + +## Low Noise Mode + +Low noise mode (LNM) is now enabled. +This confines system processes and operations to the first core of each of the four NUMA regions in a node (i.e., cores 0, 72, 144, 216). + +The consequence of this setting is that only 71 cores per socket can be requested by an application (for a total of 284 cores instead of 288 cores per node). + +!!! warning "Unable to allocate resources: Requested node configuration is not availabl" + If you try to use all 72 cores on each socket, SLURM will give a hard error, because only 71 are available: + + ``` + # try to run 4 ranks per node, with 72 cores each + > srun -n4 -N1 -c72 --reservation=reshuffling ./build/affinity.mpi + srun: error: Unable to allocate resources: Requested node configuration is not available + ``` + +One consequence of this change is that thread affinity and OpenMP settings that worked on Daint might cause large slowdown in the new configuration. + +### SLURM + +Explicitly set the number of cores per task using the `--cores-per-task/-c` flag, e.g.: +``` +#SBATCH --cores-per-task=64 +#SBATCH --cores-per-task=71 +``` +or +``` +srun -N1 -n4 -c71 ... +``` + +**Do not** use the `--cpu-bind` flag to control affinity + +* this can cause large slowdown, particularly with `--cpu-bind=socket`. We are investigating how to fix this. + +If you see significant slowdown and you want to report it, please provide the output of using the `--cpu-bind=verbose` flag. + +### OpenMP + +If your application uses OpenMP, try setting the following in your job script: + +```bash +export OPENMP_PLACES=cores +export OPENMP_PROC_BIND=close +``` + +Without these settings, we have observed application slowdown due to poor thread placement. + +## NCCL + +!!! todo + write a guide on which versions to use, environment variables to set, etc. diff --git a/mkdocs.yml b/mkdocs.yml index 4ff635a0..9c7d0324 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -97,6 +97,7 @@ nav: - guides/index.md - 'Internet Access on Alps': guides/internet-access.md - 'Storage': guides/storage.md + - 'Gordon Bell 2025': guides/gb2025.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md