Skip to content

Commit f97fa33

Browse files
authored
document affinity and openmp fixes for GB runs (#60)
1 parent 2e7cdf4 commit f97fa33

File tree

2 files changed

+88
-0
lines changed

2 files changed

+88
-0
lines changed

docs/guides/gb2025.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Gordon Bell and HPL runs 2025
2+
3+
For Gordon Bell and HPL runs in March-April 2025, CSCS has created a reservation on Santis with 1333 nodes (12 cabinets).
4+
5+
For the runs, CSCS has applied some updates and changes that aim to improve performance and scaling scale, particularly for NCCL.
6+
If you are already familiar with running on Daint, you might have to make some small changes to your current job scripts and parameters, which will be documented here.
7+
8+
## Santis
9+
10+
### Connecting
11+
12+
Connecting to Santis via SSH is the same as for Daint and Clariden, see the [ssh guide][ref-ssh] for more information.
13+
14+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Santis using `ssh santis`.
15+
```
16+
Host santis
17+
HostName santis.alps.cscs.ch
18+
ProxyJump ela
19+
# change cscsusername to your CSCS username
20+
User cscsusername
21+
IdentityFile ~/.ssh/cscs-key
22+
IdentitiesOnly yes
23+
```
24+
25+
### Reservations
26+
27+
The `normal` partition is used with no reservation, which means that that jobs can be submittied without `--partition` and `--reservation` flags.
28+
29+
### Storage
30+
31+
Your data sets from Daint are available on Santis
32+
33+
* the same Home is shared between Daint, Clariden and Santis
34+
* the same Scratch is mounted on both Santis and Daint
35+
* Store/Project are also mounted.
36+
37+
## Low Noise Mode
38+
39+
Low noise mode (LNM) is now enabled.
40+
This confines system processes and operations to the first core of each of the four NUMA regions in a node (i.e., cores 0, 72, 144, 216).
41+
42+
The consequence of this setting is that only 71 cores per socket can be requested by an application (for a total of 284 cores instead of 288 cores per node).
43+
44+
!!! warning "Unable to allocate resources: Requested node configuration is not availabl"
45+
If you try to use all 72 cores on each socket, SLURM will give a hard error, because only 71 are available:
46+
47+
```
48+
# try to run 4 ranks per node, with 72 cores each
49+
> srun -n4 -N1 -c72 --reservation=reshuffling ./build/affinity.mpi
50+
srun: error: Unable to allocate resources: Requested node configuration is not available
51+
```
52+
53+
One consequence of this change is that thread affinity and OpenMP settings that worked on Daint might cause large slowdown in the new configuration.
54+
55+
### SLURM
56+
57+
Explicitly set the number of cores per task using the `--cores-per-task/-c` flag, e.g.:
58+
```
59+
#SBATCH --cores-per-task=64
60+
#SBATCH --cores-per-task=71
61+
```
62+
or
63+
```
64+
srun -N1 -n4 -c71 ...
65+
```
66+
67+
**Do not** use the `--cpu-bind` flag to control affinity
68+
69+
* this can cause large slowdown, particularly with `--cpu-bind=socket`. We are investigating how to fix this.
70+
71+
If you see significant slowdown and you want to report it, please provide the output of using the `--cpu-bind=verbose` flag.
72+
73+
### OpenMP
74+
75+
If your application uses OpenMP, try setting the following in your job script:
76+
77+
```bash
78+
export OPENMP_PLACES=cores
79+
export OPENMP_PROC_BIND=close
80+
```
81+
82+
Without these settings, we have observed application slowdown due to poor thread placement.
83+
84+
## NCCL
85+
86+
!!! todo
87+
write a guide on which versions to use, environment variables to set, etc.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ nav:
9797
- guides/index.md
9898
- 'Internet Access on Alps': guides/internet-access.md
9999
- 'Storage': guides/storage.md
100+
- 'Gordon Bell 2025': guides/gb2025.md
100101
- 'Policies':
101102
- policies/index.md
102103
- 'User Regulations': policies/regulations.md

0 commit comments

Comments
 (0)