Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions docs/guides/gb25.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Gordon Bell 2025

This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.

## Schedule

Times are in CEST (Central European Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth)

| group | date | time | duration (h) | activity |
| --- | ---- | ----- | -----------: | --------------------- |
| - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs |
| all | 08-18| ASAP | - | Daint is available to all teams for final testing at scale |
| `g202` | 08-18| 21:00 | 2 | GB run |
| `g199` | 08-18| 23:00 | 10 | GB run |
| `g186` | 08-19| 09:00 | 6 | GB run |
| `g200` | 08-19| 15:00 | 3 | GB run |
| `g183` | 08-19| 18:00 | 24 | GB run |
| `cwd01`| 08-20| 18:00 | 5 | GB run |
| - | 08-20| 23:00 | 9 | *free slot* |
| `g188` | 08-21| 08:00 | 8 | GB run |
| `cwd01`| 08-21| 16:00 | 5 | GB run |

## System

The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes.

* [Grace-Hopper nodes][ref-alps-gh200-node].
* [using Slurm with Grace-Hopper][ref-slurm-gh200].

!!! todo "information about partition, account, time limits"

```bash
#!/bin/bash

#SBATCH --account=<account>
#SBATCH --partition=<todo>

srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
```

## Tips

### Improving job startup times

In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.

With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts.

The fix is to update how the squashfs file for the uenv or container used by your job is stored on the filesystem.

```console title="set lustre striping on uenv squashfs file"
$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
$ lfs setstripe --stripe-count -1 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
```

As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300.
```console title="increase MPI initialization time-out"
$ export PMI_MMAP_SYNC_WAIT_TIME=300
```

!!! todo "update this with the final guidance"

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ nav:
- 'LLM Inference': guides/mlp_tutorials/llm-inference.md
- 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md
- 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md
- 'Gordon Bell 2025': guides/gb25.md
- 'Policies':
- policies/index.md
- 'User Regulations': policies/regulations.md
Expand Down
Loading