-
Notifications
You must be signed in to change notification settings - Fork 41
Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
1418931
a0fc65a
5109d87
34bbcf8
5a62e5c
9c1c90a
2c4381f
6c9b1a5
7c580b8
42082ce
c4e9cb6
61b0ffe
6d22e8d
d438dcd
205d4b5
2c7904c
ef0fdf6
f099e3c
9862ec3
48ccb03
c1c72e7
bedceb6
c5383db
cfe4fe8
998eefd
17e0eff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,17 +4,21 @@ | |
| Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html). | ||
| Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools. | ||
|
|
||
| [](){#ref-build-containers-configure-podman} | ||
| ## Preliminary step: configuring Podman's storage | ||
|
|
||
| The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template: | ||
| The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home according to the following minimal template: | ||
|
|
||
| ```toml | ||
| ```toml title="$HOME/.config/containers/storage.conf" | ||
| [storage] | ||
| driver = "overlay" | ||
| runroot = "/dev/shm/$USER/runroot" | ||
| graphroot = "/dev/shm/$USER/root" | ||
| ``` | ||
|
|
||
| !!! warning | ||
| If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. | ||
lukasgd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| !!! warning | ||
| In the above configuration, `/dev/shm` is used to store the container images. | ||
| `/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman. | ||
|
|
@@ -43,11 +47,33 @@ podman build -t <image:tag> . | |
|
|
||
| In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention. | ||
|
|
||
| !!! info "Debugging the container build" | ||
| If the container build fails, you can run an interactive shell using the image from the last successfully built layer with | ||
|
|
||
| ```bash | ||
| podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void <last-layer-hash> bash # (1)! | ||
| ``` | ||
|
|
||
| 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman | ||
|
|
||
| replacing `<last-layer-hash>` by the actual hash output in the build job and interactively test the failing command. | ||
lukasgd marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ## Importing images in the Container Engine | ||
|
|
||
| An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution. | ||
| It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`. | ||
|
|
||
| !!! warning "Preliminary configuration: Lustre settings for container images" | ||
| Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes. | ||
|
|
||
| ```bash | ||
| lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)! | ||
|
||
| ``` | ||
|
|
||
| 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) | ||
|
|
||
|
|
||
| To import the image: | ||
|
|
||
| ``` | ||
|
|
@@ -62,7 +88,6 @@ image = "/<path to image directory>/<image_name.sqsh>" | |
| mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"] | ||
| workdir = "/capstor/scratch/cscs/<username>" | ||
| ``` | ||
|
|
||
| ## Pushing Images to a Container Registry | ||
|
|
||
| In order to push an image to a container registry, you first need to follow three steps: | ||
|
|
||
This file was deleted.
Uh oh!
There was an error while loading. Please reload this page.