Skip to content

Commit b687f63

Browse files
authored
update gb guide to reflect removal of LNM (#76)
1 parent 56195c9 commit b687f63

File tree

1 file changed

+17
-26
lines changed

1 file changed

+17
-26
lines changed

docs/guides/gb2025.md

Lines changed: 17 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[](){#ref-gb2025}
22
# Gordon Bell and HPL runs 2025
33

4-
For Gordon Bell and HPL runs in March-April 2025, CSCS has created a reservation on Santis with 1333 nodes (12 cabinets).
4+
For Gordon Bell and HPL runs in March-April 2025, CSCS has expanded Santis to 1333 nodes (12 cabinets).
55

66
For the runs, CSCS has applied some updates and changes that aim to improve performance and scaling scale, particularly for NCCL.
77
If you are already familiar with running on Daint, you might have to make some small changes to your current job scripts and parameters, which will be documented here.
@@ -27,6 +27,18 @@ Host santis
2727

2828
The `normal` partition is used with no reservation, which means that that jobs can be submittied without `--partition` and `--reservation` flags.
2929

30+
Timeline:
31+
32+
1. Friday 4th April:
33+
* HPE finish HPL runs at 10:30am
34+
* CSCS performs testing on the reconfigured system for ~1 hour on the `GB_TESTING_2` reservation
35+
* The reservation is removed and all GB teams have access to test and tune applications.
36+
2. Monday 7th April:
37+
* at 4pm the runs will start for the first team
38+
39+
!!! note
40+
There will be no special reservation during the open testing and tuning between Friday and Monday.
41+
3042
### Storage
3143

3244
Your data sets from Daint are available on Santis
@@ -37,51 +49,30 @@ Your data sets from Daint are available on Santis
3749

3850
## Low Noise Mode
3951

40-
Low noise mode (LNM) is now enabled.
41-
This confines system processes and operations to the first core of each of the four NUMA regions in a node (i.e., cores 0, 72, 144, 216).
42-
43-
The consequence of this setting is that only 71 cores per socket can be requested by an application (for a total of 284 cores instead of 288 cores per node).
52+
!!! note
53+
Low noise mode has been disabled, so the previous requirement that you set `OMP_PLACES` and `OMP_PROC_BIND` no longer applies.
4454

4555
!!! warning "Unable to allocate resources: Requested node configuration is not available"
4656
If you try to use all 72 cores on each socket, SLURM will give a hard error, because only 71 are available:
4757

4858
```console
4959
# try to run 4 ranks per node, with 72 cores each
50-
$ srun -n4 -N1 -c72 --reservation=reshuffling ./build/affinity.mpi
60+
$ srun -n4 -N1 -c72 ./build/affinity.mpi
5161
srun: error: Unable to allocate resources: Requested node configuration is not available
5262
```
5363

54-
One consequence of this change is that thread affinity and OpenMP settings that worked on Daint might cause large slowdown in the new configuration.
55-
5664
### SLURM
5765

5866
Explicitly set the number of cores per task using the `--cpus-per-task/-c` flag, e.g.:
67+
For example:
5968
```
6069
#SBATCH --cpus-per-task=64
61-
#SBATCH --cpus-per-task=71
6270
```
6371
or
6472
```
6573
srun -N1 -n4 -c71 ...
6674
```
6775

68-
**Do not** use the `--cpu-bind` flag to control affinity
69-
70-
* this can cause large slowdown, particularly with `--cpu-bind=socket`. We are investigating how to fix this.
71-
72-
If you see significant slowdown and you want to report it, please provide the output of using the `--cpu-bind=verbose` flag.
73-
74-
### OpenMP
75-
76-
If your application uses OpenMP, try setting the following in your job script:
77-
78-
```bash
79-
export OMP_PLACES=cores
80-
export OMP_PROC_BIND=close
81-
```
82-
83-
Without these settings, we have observed application slowdown due to poor thread placement.
84-
8576
## NCCL
8677

8778
!!! todo

0 commit comments

Comments
 (0)