Skip to content

Commit 01ecb62

Browse files
authored
Updates to ML docs from PASC HiRAD collaborators (#211)
1 parent 43f6a6f commit 01ecb62

File tree

3 files changed

+15
-1
lines changed

3 files changed

+15
-1
lines changed

docs/guides/mlp_tutorials/llm-nanotron-training.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,8 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c "
381381
"
382382
```
383383

384+
Note that, the quoted block inside the `srun` command gets executed by each task separately, i.e. 4 times per node, but all tasks on a node share the _same_ container. This is different to the setup with `torchrun` where only one task executes the lines above the final `python` command. This is important to be aware of in order to avoid any accidental race condition (e.g. by writing to the container filesystem in one of these lines).
385+
384386

385387
## Launch a Training Job
386388

@@ -397,3 +399,12 @@ You can inspect if your job has been submitted successfully by running `squeue -
397399
```
398400

399401
In the end, the checkpoints of the model will be saved in `checkpoints/`.
402+
403+
!!! note "Core dump files"
404+
In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it can be useful to disable their creation by adding the line
405+
406+
```bash
407+
ulimit -c 0
408+
```
409+
410+
to the sbatch script.

docs/software/container-engine/run.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ There are three ways to do so:
2121
$ srun --environment=ubuntu echo "Hello"
2222
```
2323

24+
!!! note "Shared container at the node-level"
25+
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
26+
2427
### Use from batch scripts
2528

2629
Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):

docs/storage/filesystems.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ If you are in multiple projects, information for the [Store][ref-storage-store]
246246
```
247247

248248

249-
Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported.
249+
Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported. Note that the path `/vast/users/cscs/user` is mounted at `/users/user` (i.e. `$HOME`) on Alps.
250250

251251
[](){#ref-storage-backup}
252252
## Backup

0 commit comments

Comments
 (0)