Skip to content

Commit a0fc65a

Browse files
committed
Merge branch 'main' into mlp-tutorials-update-iii
2 parents 1418931 + 522e297 commit a0fc65a

File tree

4 files changed

+79
-4
lines changed

4 files changed

+79
-4
lines changed

docs/guides/gb25.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Gordon Bell 2025
2+
3+
This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.
4+
5+
## Schedule
6+
7+
Times are in CEST (Central European Time): [Time conversion table](https://time.is/compare/0800_18_Aug_2025_in_CET/PT/ET/Perth)
8+
9+
| group | date | time | duration (h) | activity |
10+
| --- | ---- | ----- | -----------: | --------------------- |
11+
| - | 08-18| 08:00 | - | Daint is reconfigured and resized for GB runs |
12+
| all | 08-18| ASAP | - | Daint is available to all teams for final testing at scale |
13+
| `g202` | 08-18| 21:00 | 2 | GB run |
14+
| `g199` | 08-18| 23:00 | 10 | GB run |
15+
| `g186` | 08-19| 09:00 | 6 | GB run |
16+
| `g200` | 08-19| 15:00 | 3 | GB run |
17+
| `g183` | 08-19| 18:00 | 24 | GB run |
18+
| `cwd01`| 08-20| 18:00 | 5 | GB run |
19+
| - | 08-20| 23:00 | 9 | *free slot* |
20+
| `g188` | 08-21| 08:00 | 8 | GB run |
21+
22+
## System
23+
24+
The system [Daint][ref-cluster-daint] will be expanded to approximately 2350 Grace-Hopper nodes.
25+
26+
* [Grace-Hopper nodes][ref-alps-gh200-node].
27+
* [using Slurm with Grace-Hopper][ref-slurm-gh200].
28+
29+
!!! todo "information about partition, account, time limits"
30+
31+
```bash
32+
#!/bin/bash
33+
34+
#SBATCH --account=<group>
35+
#SBATCH --partition=normal
36+
#SBATCH --reservation=<group>
37+
38+
srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
39+
```
40+
41+
## Recommendations on run configuration
42+
43+
### Disabling core-dumps
44+
45+
If a large job crashes and tries to write core-dump files on thousands of processes,
46+
it will overwhelm the filesystem. Therefore we strongly recommend to disable them with
47+
the following command:
48+
49+
``` console title="disable writing of core-dump files"
50+
$ ulimit -S -c0
51+
```
52+
53+
### Improving job startup times
54+
55+
In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.
56+
57+
With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before `main()` starts.
58+
59+
The fix is to update how the SquashFS file for the uenv or container used by your job is stored on the filesystem.
60+
61+
```console title="set lustre striping on uenv squashfs file"
62+
$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
63+
/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
64+
$ lfs migrate --stripe-count 20 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
65+
```
66+
67+
If you are using a [SquashFS image for your Python environment][ref-guides-storage-venv],
68+
you should also set the striping for that file.
69+
70+
As an additional precaution, we recommend to increase the default wait threshold for `MPI_Init` from 180 seconds to 300.
71+
```console title="increase MPI initialization time-out"
72+
$ export PMI_MMAP_SYNC_WAIT_TIME=300
73+
```
74+

docs/services/cicd.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -763,7 +763,7 @@ value: `jfrog.svc.cscs.ch/docker-ci-ext/<repository-id>`
763763

764764
The prefix path in the CSCS internal container image registry, to which your pipeline has write access.
765765
Within this prefix, you can choose any directory structure.
766-
Images that are pushed to a path matching **/public/** , can be pulled by anybody within CSCS network
766+
Images that are pushed to a path matching `**/public/**`, can be pulled by anybody within CSCS network
767767

768768
### `CSCS_CI_MW_URL`
769769
value: `https://cicd-ext-mw.cscs.ch/ci`

docs/software/container-engine/edf.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -168,15 +168,15 @@ Environment variables to set in the container. Empty string values will unset th
168168
* Basic `env` block
169169
```toml
170170
[env]
171-
MY_RUN = "production",
171+
MY_RUN = "production"
172172
DEBUG = "false"
173173
```
174174

175175
* Use of environment variable expansion
176176
```toml
177177
[env]
178-
MY_NODE = "${VAR_FROM_HOST}",
179-
PATH = "${PATH}:/custom/bin",
178+
MY_NODE = "${VAR_FROM_HOST}"
179+
PATH = "${PATH}:/custom/bin"
180180
DEBUG = "true"
181181
```
182182

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@ nav:
130130
- 'Internet Access on Alps': guides/internet-access.md
131131
- 'Storage': guides/storage.md
132132
- 'Using the terminal': guides/terminal.md
133+
- 'Gordon Bell 2025': guides/gb25.md
133134
- 'Policies':
134135
- policies/index.md
135136
- 'User Regulations': policies/regulations.md

0 commit comments

Comments
 (0)