From 75a4b5f701827e8cf33af68b33b06e54fe8d958c Mon Sep 17 00:00:00 2001 From: bcumming Date: Fri, 15 Aug 2025 11:45:19 +0200 Subject: [PATCH 1/6] draft gb docs --- docs/guides/gb25.md | 48 +++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 49 insertions(+) create mode 100644 docs/guides/gb25.md diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md new file mode 100644 index 00000000..03738a74 --- /dev/null +++ b/docs/guides/gb25.md @@ -0,0 +1,48 @@ +# Gordon Bell 2025 + +This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025. + +## Schedule + +| group | date | time | activity | +| --- | ---- | ----- | --------------------- | +| - | 08-15| 08:00 | Daint is reconfigured and resized for GB runs | +| all | 08-15| ASAP | Daint is available to all teams for final testing at scale | +| `g???`| 08-15| 18:00 | Daint is available to all teams for final testing at scale | + +## System + +The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes. + +* [Grace-Hopper nodes][ref-alps-gh200-node]. +* [using Slurm with Grace-Hopper][ref-slurm-gh200]. + +!!! todo "information about partition, account, time limits" + +```bash +#!/bin/bash + +#SBATCH --account= +#SBATCH --partition= + +srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? .... +``` + +## Tips + +### Improving job startup times + +In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup. + +With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts. + +The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem. + +```console title="set lustre striping on uenv squashfs file" +$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}' +/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs +$ lfs setstripe --stripe-count -1 --stripe-size 4M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}') +``` + +!!! todo "update this with the final guidance" + diff --git a/mkdocs.yml b/mkdocs.yml index 22d816e9..3ac6e5b6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -129,6 +129,7 @@ nav: - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md + - 'Gordon Bell 2025': guides/gb25.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From 6a2fb0f1ade66888c89ae58f39664d413f24f0ed Mon Sep 17 00:00:00 2001 From: bcumming Date: Fri, 15 Aug 2025 11:52:18 +0200 Subject: [PATCH 2/6] fix cut and paste in table --- docs/guides/gb25.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md index 03738a74..9c57ff3f 100644 --- a/docs/guides/gb25.md +++ b/docs/guides/gb25.md @@ -8,7 +8,7 @@ This is temporary documentation for the Gordon Bell second round benchmark runs | --- | ---- | ----- | --------------------- | | - | 08-15| 08:00 | Daint is reconfigured and resized for GB runs | | all | 08-15| ASAP | Daint is available to all teams for final testing at scale | -| `g???`| 08-15| 18:00 | Daint is available to all teams for final testing at scale | +| `g???`| 08-15| 18:00 | First team runs | ## System From 6a73c561535fc3000581be0ab1d8a7e79c0ee416 Mon Sep 17 00:00:00 2001 From: Sebastian Keller Date: Fri, 15 Aug 2025 12:36:13 +0200 Subject: [PATCH 3/6] increase default MPI_Init wait threshold --- docs/guides/gb25.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md index 9c57ff3f..adadbb8f 100644 --- a/docs/guides/gb25.md +++ b/docs/guides/gb25.md @@ -41,7 +41,12 @@ The fix is to update how the squashfs file for the uenv or container used by you ```console title="set lustre striping on uenv squashfs file" $ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}' /capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs -$ lfs setstripe --stripe-count -1 --stripe-size 4M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}') +$ lfs setstripe --stripe-count -1 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}') +``` + +As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300. +```console title="increase MPI initialization time-out" +$ export PMI_MMAP_SYNC_WAIT_TIME=300 ``` !!! todo "update this with the final guidance" From c36690269e4aeff948f5976c10cfd75fbbdab908 Mon Sep 17 00:00:00 2001 From: bcumming Date: Mon, 18 Aug 2025 08:38:51 +0200 Subject: [PATCH 4/6] update schedule for gb --- docs/guides/gb25.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md index 9c57ff3f..3e3dfe5c 100644 --- a/docs/guides/gb25.md +++ b/docs/guides/gb25.md @@ -4,11 +4,19 @@ This is temporary documentation for the Gordon Bell second round benchmark runs ## Schedule -| group | date | time | activity | -| --- | ---- | ----- | --------------------- | -| - | 08-15| 08:00 | Daint is reconfigured and resized for GB runs | -| all | 08-15| ASAP | Daint is available to all teams for final testing at scale | -| `g???`| 08-15| 18:00 | First team runs | +| group | date | time | duration (h) | activity | +| --- | ---- | ----- | -----------: | --------------------- | +| - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs | +| all | 08-18| ASAP | - | Daint is available to all teams for final testing at scale | +| `g202` | 08-18| 21:00 | 2 | GB run | +| `g199` | 08-18| 23:00 | 10 | GB run | +| `g186` | 08-19| 09:00 | 6 | GB run | +| `g200` | 08-19| 15:00 | 3 | GB run | +| `g183` | 08-19| 18:00 | 24 | GB run | +| `cwd01`| 08-20| 18:00 | 5 | GB run | +| - | 08-20| 23:00 | 9 | *free slot* | +| `g188` | 08-21| 08:00 | 8 | GB run | +| `cwd01`| 08-21| 16:00 | 5 | GB run | ## System From 00fea9adbc026d32d6a37d11321d25b9a9b2d8f8 Mon Sep 17 00:00:00 2001 From: bcumming Date: Mon, 18 Aug 2025 08:51:13 +0200 Subject: [PATCH 5/6] add schedule --- docs/guides/gb25.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md index e56a6005..43008ed0 100644 --- a/docs/guides/gb25.md +++ b/docs/guides/gb25.md @@ -4,6 +4,8 @@ This is temporary documentation for the Gordon Bell second round benchmark runs ## Schedule +Times are in CEST (Central European Time): [Time converstion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth) + | group | date | time | duration (h) | activity | | --- | ---- | ----- | -----------: | --------------------- | | - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs | From 60cb27f9715efd09617afa3ab4d3426a0c7b3031 Mon Sep 17 00:00:00 2001 From: bcumming Date: Mon, 18 Aug 2025 08:53:12 +0200 Subject: [PATCH 6/6] typo --- docs/guides/gb25.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/gb25.md b/docs/guides/gb25.md index 43008ed0..38f9ce22 100644 --- a/docs/guides/gb25.md +++ b/docs/guides/gb25.md @@ -4,7 +4,7 @@ This is temporary documentation for the Gordon Bell second round benchmark runs ## Schedule -Times are in CEST (Central European Time): [Time converstion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth) +Times are in CEST (Central European Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth) | group | date | time | duration (h) | activity | | --- | ---- | ----- | -----------: | --------------------- |