Skip to content

Commit c5383db

Browse files
committed
small cleanup
1 parent bedceb6 commit c5383db

File tree

3 files changed

+15
-14
lines changed

3 files changed

+15
-14
lines changed

docs/access/jupyterlab.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,9 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/
199199

200200
While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.
201201

202-
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
202+
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials].
203+
In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash).
204+
For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
203205

204206
```bash
205207
!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ...

docs/build-install/containers.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ graphroot = "/dev/shm/$USER/root"
1717
```
1818

1919
!!! warning
20-
If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. See also [this guide][ref-guides-terminal-arch] for further information about XDG variables.
20+
If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead.
21+
See the [terminal user guide][ref-guides-terminal-arch] for further information about XDG variables.
2122

2223
!!! warning
2324
In the above configuration, `/dev/shm` is used to store the container images.
@@ -64,15 +65,9 @@ In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-bu
6465
An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution.
6566
It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`.
6667

67-
!!! warning "Preliminary configuration: Lustre settings for container images"
68-
Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes.
69-
70-
```bash
71-
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)!
72-
```
73-
74-
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
75-
68+
!!! info "Preliminary configuration: Lustre settings for container images"
69+
Container images are stored in a single [SquashFS]() file, that is typically between 1-20 GB in size (particularly for large ML containers).
70+
To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.
7671

7772
To import the image:
7873

docs/platforms/mlp/index.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@
33

44
The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
55

6+
<div class="grid cards" markdown>
7+
- :fontawesome-solid-mountain: [__Tutorials__][ref-software-ml-tutorials]
8+
9+
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials].
10+
11+
</div>
12+
613
## Getting started
714

815
### Getting access
@@ -89,6 +96,3 @@ Project is per project - each project gets a project folder with project-specifi
8996
* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela
9097
* it is not recommended to write directly to the project path from jobs.
9198

92-
## Guides and tutorials
93-
94-
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials].

0 commit comments

Comments
 (0)