GB2025 temporary docs (#219)

bcumming · web-flow · commit e2e644127ba6 · 2025-08-18T09:00:51.000+02:00
diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md
@@ -0,0 +1,63 @@
+# Gordon Bell 2025
+
+This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.
+
+## Schedule
+
+Times are in CEST (Central European  Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth)
+
+| group  | date | time  | duration (h) | activity |
+| ---    | ---- | ----- | -----------: | --------------------- |
+| -      | 08-18| 08:00 | -            | Daint is reconfigured and resized for GB runs             |
+| all    | 08-18| ASAP  | -            | Daint is available to all teams for final testing at scale |
+| `g202` | 08-18| 21:00 | 2            | GB run |
+| `g199` | 08-18| 23:00 | 10           | GB run |
+| `g186` | 08-19| 09:00 | 6            | GB run |
+| `g200` | 08-19| 15:00 | 3            | GB run |
+| `g183` | 08-19| 18:00 | 24           | GB run |
+| `cwd01`| 08-20| 18:00 | 5            | GB run |
+| -      | 08-20| 23:00 | 9            | *free slot* |
+| `g188` | 08-21| 08:00 | 8            | GB run |
+| `cwd01`| 08-21| 16:00 | 5            | GB run |
+
+## System
+
+The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes.
+
+* [Grace-Hopper nodes][ref-alps-gh200-node].
+* [using Slurm with Grace-Hopper][ref-slurm-gh200].
+
+!!! todo "information about partition, account, time limits"
+
+```bash
+#!/bin/bash
+
+#SBATCH --account=<account>
+#SBATCH --partition=<todo>
+
+srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
+```
+
+## Tips
+
+### Improving job startup times
+
+In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.
+
+With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts.
+
+The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem.
+
+```console title="set lustre striping on uenv squashfs file"
+$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
+/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
+$ lfs setstripe --stripe-count -1 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
+```
+
+As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300.
+```console title="increase MPI initialization time-out"
+$ export PMI_MMAP_SYNC_WAIT_TIME=300
+```
+
+!!! todo "update this with the final guidance"
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -130,6 +130,7 @@ nav:
       - 'LLM Inference': guides/mlp_tutorials/llm-inference.md
       - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md
       - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md
+    - 'Gordon Bell 2025': guides/gb25.md
   - 'Policies':
     - policies/index.md
     - 'User Regulations': policies/regulations.md