diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md new file mode 100644 index 00000000..38f9ce22 --- /dev/null +++ b/docs/guides/gb25.md @@ -0,0 +1,63 @@ +# Gordon Bell 2025 + +This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025. + +## Schedule + +Times are in CEST (Central European Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth) + +| group | date | time | duration (h) | activity | +| --- | ---- | ----- | -----------: | --------------------- | +| - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs | +| all | 08-18| ASAP | - | Daint is available to all teams for final testing at scale | +| `g202` | 08-18| 21:00 | 2 | GB run | +| `g199` | 08-18| 23:00 | 10 | GB run | +| `g186` | 08-19| 09:00 | 6 | GB run | +| `g200` | 08-19| 15:00 | 3 | GB run | +| `g183` | 08-19| 18:00 | 24 | GB run | +| `cwd01`| 08-20| 18:00 | 5 | GB run | +| - | 08-20| 23:00 | 9 | *free slot* | +| `g188` | 08-21| 08:00 | 8 | GB run | +| `cwd01`| 08-21| 16:00 | 5 | GB run | + +## System + +The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes. + +* [Grace-Hopper nodes][ref-alps-gh200-node]. +* [using Slurm with Grace-Hopper][ref-slurm-gh200]. + +!!! todo "information about partition, account, time limits" + +```bash +#!/bin/bash + +#SBATCH --account= +#SBATCH --partition= + +srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? .... +``` + +## Tips + +### Improving job startup times + +In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup. + +With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts. + +The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem. + +```console title="set lustre striping on uenv squashfs file" +$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}' +/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs +$ lfs setstripe --stripe-count -1 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}') +``` + +As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300. +```console title="increase MPI initialization time-out" +$ export PMI_MMAP_SYNC_WAIT_TIME=300 +``` + +!!! todo "update this with the final guidance" + diff --git a/mkdocs.yml b/mkdocs.yml index 56777a14..c1167de4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -130,6 +130,7 @@ nav: - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md + - 'Gordon Bell 2025': guides/gb25.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md