From 8e47d9d57e2e3cf88d79f90d76deb7630cd87528 Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 5 Aug 2025 14:26:46 +0200 Subject: [PATCH] Updates from HiRAD collaborators --- docs/guides/mlp_tutorials/llm-nanotron-training.md | 11 +++++++++++ docs/software/container-engine/run.md | 3 +++ docs/storage/filesystems.md | 2 +- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index 5072b912..10194a20 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -381,6 +381,8 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " " ``` + Note that, the quoted block inside the `srun` command gets executed by each task separately, i.e. 4 times per node, but all tasks on a node share the _same_ container. This is different to the setup with `torchrun` where only one task executes the lines above the final `python` command. This is important to be aware of in order to avoid any accidental race condition (e.g. by writing to the container filesystem in one of these lines). + ## Launch a Training Job @@ -397,3 +399,12 @@ You can inspect if your job has been submitted successfully by running `squeue - ``` In the end, the checkpoints of the model will be saved in `checkpoints/`. + +!!! note "Core dump files" + In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it can be useful to disable their creation by adding the line + + ```bash + ulimit -c 0 + ``` + + to the sbatch script. diff --git a/docs/software/container-engine/run.md b/docs/software/container-engine/run.md index 8f0cb48e..6d399d91 100644 --- a/docs/software/container-engine/run.md +++ b/docs/software/container-engine/run.md @@ -21,6 +21,9 @@ There are three ways to do so: $ srun --environment=ubuntu echo "Hello" ``` +!!! note "Shared container at the node-level" + For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node. + ### Use from batch scripts Use `--environment` with the Slurm command (e.g., `srun` or `salloc`): diff --git a/docs/storage/filesystems.md b/docs/storage/filesystems.md index 88358157..b86857e4 100644 --- a/docs/storage/filesystems.md +++ b/docs/storage/filesystems.md @@ -246,7 +246,7 @@ If you are in multiple projects, information for the [Store][ref-storage-store] ``` - Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported. + Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported. Note that the path `/vast/users/cscs/user` is mounted at `/users/user` (i.e. `$HOME`) on Alps. [](){#ref-storage-backup} ## Backup