-
Notifications
You must be signed in to change notification settings - Fork 41
Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 17 commits
1418931
a0fc65a
5109d87
34bbcf8
5a62e5c
9c1c90a
2c4381f
6c9b1a5
7c580b8
42082ce
c4e9cb6
61b0ffe
6d22e8d
d438dcd
205d4b5
2c7904c
ef0fdf6
f099e3c
9862ec3
48ccb03
c1c72e7
bedceb6
c5383db
cfe4fe8
998eefd
17e0eff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| JAX | ||
| nvitop | ||
| NVRTC | ||
| placeholders |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,17 +4,21 @@ | |
| Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html). | ||
| Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools. | ||
|
|
||
| [](){#ref-build-containers-configure-podman} | ||
| ## Preliminary step: configuring Podman's storage | ||
|
|
||
| The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template: | ||
| The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home according to the following minimal template: | ||
|
|
||
| ```toml | ||
| ```toml title="$HOME/.config/containers/storage.conf" | ||
| [storage] | ||
| driver = "overlay" | ||
| runroot = "/dev/shm/$USER/runroot" | ||
| graphroot = "/dev/shm/$USER/root" | ||
| ``` | ||
|
|
||
| !!! warning | ||
| If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. See also [this guide][ref-guides-terminal-arch] for further information about XDG variables. | ||
|
|
||
| !!! warning | ||
| In the above configuration, `/dev/shm` is used to store the container images. | ||
| `/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman. | ||
|
|
@@ -43,11 +47,33 @@ podman build -t <image:tag> . | |
|
|
||
| In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention. | ||
|
|
||
| !!! info "Debugging the container build" | ||
| If the container build fails, you can run an interactive shell using the image from the last successfully built layer with | ||
|
|
||
| ```bash | ||
| podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void <last-layer-hash> bash # (1)! | ||
| ``` | ||
|
|
||
| 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman | ||
|
|
||
| replacing `<last-layer-hash>` with the actual hash output in the build job and interactively test the failing command. | ||
|
|
||
|
|
||
| ## Importing images in the Container Engine | ||
|
|
||
| An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution. | ||
| It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`. | ||
|
|
||
| !!! warning "Preliminary configuration: Lustre settings for container images" | ||
| Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes. | ||
|
|
||
| ```bash | ||
| lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)! | ||
|
||
| ``` | ||
|
|
||
| 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) | ||
|
|
||
|
|
||
| To import the image: | ||
|
|
||
| ``` | ||
|
|
@@ -62,7 +88,6 @@ image = "/<path to image directory>/<image_name.sqsh>" | |
| mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"] | ||
| workdir = "/capstor/scratch/cscs/<username>" | ||
| ``` | ||
|
|
||
| ## Pushing Images to a Container Registry | ||
|
|
||
| In order to push an image to a container registry, you first need to follow three steps: | ||
|
|
||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi | |
|
|
||
| ## Guides and tutorials | ||
|
|
||
| Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. | ||
| Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. | ||
|
||
Uh oh!
There was an error while loading. Please reload this page.