Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/guides/mlp_tutorials/llm-nanotron-training.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,8 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c "
"
```

Note that, the quoted block inside the `srun` command gets executed by each task separately, i.e. 4 times per node, but all tasks on a node share the _same_ container. This is different to the setup with `torchrun` where only one task executes the lines above the final `python` command. This is important to be aware of in order to avoid any accidental race condition (e.g. by writing to the container filesystem in one of these lines).


## Launch a Training Job

Expand All @@ -397,3 +399,12 @@ You can inspect if your job has been submitted successfully by running `squeue -
```

In the end, the checkpoints of the model will be saved in `checkpoints/`.

!!! note "Core dump files"
In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it can be useful to disable their creation by adding the line

```bash
ulimit -c 0
```

to the sbatch script.
3 changes: 3 additions & 0 deletions docs/software/container-engine/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ There are three ways to do so:
$ srun --environment=ubuntu echo "Hello"
```

!!! note "Shared container at the node-level"
For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.

### Use from batch scripts

Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):
Expand Down
2 changes: 1 addition & 1 deletion docs/storage/filesystems.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ If you are in multiple projects, information for the [Store][ref-storage-store]
```


Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported.
Here the user is in two projects, namely `g33` and `csstaff`, for which the quota for their respective paths in `/capstor/store` are reported. Note that the path `/vast/users/cscs/user` is mounted at `/users/user` (i.e. `$HOME`) on Alps.

[](){#ref-storage-backup}
## Backup
Expand Down