diff --git a/docs/clusters/santis.md b/docs/clusters/santis.md index a3564e1f..73aafddf 100644 --- a/docs/clusters/santis.md +++ b/docs/clusters/santis.md @@ -7,10 +7,7 @@ Santis is an Alps cluster that provides GPU accelerators and file systems design ### Compute nodes -Santis consists of around 600 [Grace-Hopper nodes][ref-alps-gh200-node]. - -!!! note - In late March 2025 Santis was temporarily expanded to 1233 nodes for [Gordon Bell and HPL runs][ref-gb2025]. +Santis consists of around 430 [Grace-Hopper nodes][ref-alps-gh200-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps. @@ -19,7 +16,7 @@ You will be assigned to one of the four login nodes when you ssh onto the system | node type | number of nodes | total CPU sockets | total GPUs | |-----------|-----------------| ----------------- | ---------- | -| [gh200][ref-alps-gh200-node] | 600 | 2,400 | 2,400 | +| [gh200][ref-alps-gh200-node] | 430 | 1,720 | 1,720 | ### Storage and file systems diff --git a/docs/guides/gb2025.md b/docs/guides/gb2025.md deleted file mode 100644 index 9b00b66b..00000000 --- a/docs/guides/gb2025.md +++ /dev/null @@ -1,87 +0,0 @@ -[](){#ref-gb2025} -# Gordon Bell and HPL runs 2025 - -For Gordon Bell and HPL runs in March-April 2025, CSCS has expanded Santis to 1333 nodes (12 cabinets). - -For the runs, CSCS has applied some updates and changes that aim to improve performance and scaling scale, particularly for NCCL. -If you are already familiar with running on Daint, you might have to make some small changes to your current job scripts and parameters, which will be documented here. - -## Santis - -### Connecting - -Connecting to Santis via SSH is the same as for Daint and Clariden, see the [ssh guide][ref-ssh] for more information. - -Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Santis using `ssh santis`. -``` -Host santis - HostName santis.alps.cscs.ch - ProxyJump ela -# change cscsusername to your CSCS username - User cscsusername - IdentityFile ~/.ssh/cscs-key - IdentitiesOnly yes -``` - -### Reservations - -The `normal` partition is used with no reservation, which means that that jobs can be submittied without `--partition` and `--reservation` flags. - -Timeline: - -1. Friday 4th April: - * HPE finish HPL runs at 10:30am - * CSCS performs testing on the reconfigured system for ~1 hour on the `GB_TESTING_2` reservation - * The reservation is removed and all GB teams have access to test and tune applications. -2. Monday 7th April: - * at 4pm the runs will start for the first team - -!!! note - There will be no special reservation during the open testing and tuning between Friday and Monday. - -### Storage - -Your data sets from Daint are available on Santis - -* the same Home is shared between Daint, Clariden and Santis -* the same Scratch is mounted on both Santis and Daint -* Store/Project are also mounted. - -## Low Noise Mode - -!!! note - Low noise mode has been relaxed, so the previous requirement that you set `OMP_PLACES` and `OMP_PROC_BIND` no longer applies. - One core per module is still reserved for system processes. - -Santis uses low noise mode, which reserves one core per Grace-Hopper module (i.e. per 72 cores) for system processes. -This mode is intended to reduce performance variability caused by system processes interfering with application threads and processes. -This means that SLURM job scripts must be updated to account for the reserved cores. - -### SLURM - -!!! warning "Unable to allocate resources: Requested node configuration is not available" - If you try to use all 72 cores on each socket, SLURM will give a hard error, because only 71 are available: - - ```console - # try to run 4 ranks per node, with 72 cores each - $ srun -n4 -N1 -c72 ./build/affinity.mpi - srun: error: Unable to allocate resources: Requested node configuration is not available - ``` - -Explicitly set the number of cores per task using the `--cpus-per-task/-c` flag, e.g.: -``` -#SBATCH --cpus-per-task=64 -``` -or -``` -srun -N1 -n4 -c71 ... -``` - -## NCCL - -!!! todo - write a guide on which versions to use, environment variables to set, etc. - -See [the container engine documentation][ref-ce-aws-ofi-hook] for information on using NCCL in containers. -The [NCCL][ref-communication-nccl] contains general information on configuring NCCL. -This information is especially important when using uenvs, as the environment variables are not set automatically. diff --git a/mkdocs.yml b/mkdocs.yml index 5529d83a..b7879f9c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -96,7 +96,6 @@ nav: - 'Internet Access on Alps': guides/internet-access.md - 'Storage': guides/storage.md - 'Using the terminal': guides/terminal.md - - 'Gordon Bell 2025': guides/gb2025.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md