Skip to content

Commit e2e6441

Browse files
authored
GB2025 temporary docs (#219)
1 parent 742319d commit e2e6441

File tree

2 files changed

+64
-0
lines changed

2 files changed

+64
-0
lines changed

docs/guides/gb25.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Gordon Bell 2025
2+
3+
This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.
4+
5+
## Schedule
6+
7+
Times are in CEST (Central European Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth)
8+
9+
| group | date | time | duration (h) | activity |
10+
| --- | ---- | ----- | -----------: | --------------------- |
11+
| - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs |
12+
| all | 08-18| ASAP | - | Daint is available to all teams for final testing at scale |
13+
| `g202` | 08-18| 21:00 | 2 | GB run |
14+
| `g199` | 08-18| 23:00 | 10 | GB run |
15+
| `g186` | 08-19| 09:00 | 6 | GB run |
16+
| `g200` | 08-19| 15:00 | 3 | GB run |
17+
| `g183` | 08-19| 18:00 | 24 | GB run |
18+
| `cwd01`| 08-20| 18:00 | 5 | GB run |
19+
| - | 08-20| 23:00 | 9 | *free slot* |
20+
| `g188` | 08-21| 08:00 | 8 | GB run |
21+
| `cwd01`| 08-21| 16:00 | 5 | GB run |
22+
23+
## System
24+
25+
The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes.
26+
27+
* [Grace-Hopper nodes][ref-alps-gh200-node].
28+
* [using Slurm with Grace-Hopper][ref-slurm-gh200].
29+
30+
!!! todo "information about partition, account, time limits"
31+
32+
```bash
33+
#!/bin/bash
34+
35+
#SBATCH --account=<account>
36+
#SBATCH --partition=<todo>
37+
38+
srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
39+
```
40+
41+
## Tips
42+
43+
### Improving job startup times
44+
45+
In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.
46+
47+
With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts.
48+
49+
The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem.
50+
51+
```console title="set lustre striping on uenv squashfs file"
52+
$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
53+
/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
54+
$ lfs setstripe --stripe-count -1 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
55+
```
56+
57+
As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300.
58+
```console title="increase MPI initialization time-out"
59+
$ export PMI_MMAP_SYNC_WAIT_TIME=300
60+
```
61+
62+
!!! todo "update this with the final guidance"
63+

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@ nav:
130130
- 'LLM Inference': guides/mlp_tutorials/llm-inference.md
131131
- 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md
132132
- 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md
133+
- 'Gordon Bell 2025': guides/gb25.md
133134
- 'Policies':
134135
- policies/index.md
135136
- 'User Regulations': policies/regulations.md

0 commit comments

Comments
 (0)