Skip to content

Commit 75a4b5f

Browse files
committed
draft gb docs
1 parent 0dc28dc commit 75a4b5f

File tree

2 files changed

+49
-0
lines changed

2 files changed

+49
-0
lines changed

docs/guides/gb25.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Gordon Bell 2025
2+
3+
This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.
4+
5+
## Schedule
6+
7+
| group | date | time | activity |
8+
| --- | ---- | ----- | --------------------- |
9+
| - | 08-15| 08:00 | Daint is reconfigured and resized for GB runs |
10+
| all | 08-15| ASAP | Daint is available to all teams for final testing at scale |
11+
| `g???`| 08-15| 18:00 | Daint is available to all teams for final testing at scale |
12+
13+
## System
14+
15+
The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes.
16+
17+
* [Grace-Hopper nodes][ref-alps-gh200-node].
18+
* [using Slurm with Grace-Hopper][ref-slurm-gh200].
19+
20+
!!! todo "information about partition, account, time limits"
21+
22+
```bash
23+
#!/bin/bash
24+
25+
#SBATCH --account=<account>
26+
#SBATCH --partition=<todo>
27+
28+
srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
29+
```
30+
31+
## Tips
32+
33+
### Improving job startup times
34+
35+
In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.
36+
37+
With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts.
38+
39+
The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem.
40+
41+
```console title="set lustre striping on uenv squashfs file"
42+
$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
43+
/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
44+
$ lfs setstripe --stripe-count -1 --stripe-size 4M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
45+
```
46+
47+
!!! todo "update this with the final guidance"
48+

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ nav:
129129
- 'LLM Inference': guides/mlp_tutorials/llm-inference.md
130130
- 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md
131131
- 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md
132+
- 'Gordon Bell 2025': guides/gb25.md
132133
- 'Policies':
133134
- policies/index.md
134135
- 'User Regulations': policies/regulations.md

0 commit comments

Comments
 (0)