This guide assumes you already have successfully logged in to Alps, and added to the a-infra-02 project.
-
ssh into Clariden and go to the '/capstor/scratch/cscs/$USER' directory
-
Get a compute node:
srun -t 5:00:00 -A a-infra02 --container-writable --pty bash -
Clone the sailor repo
git clone https://github.com/eth-easl/sailor.git && cd sailor && git checkout sosp25_ae -
While in the folder with the Dockerfile, create a new a new image (adjust the name as you want)
podman build -t test:v1 . -
You can see your image now using this command
podman images -
use enroot to export the image into a squash file
enroot import -o test.sqsh podman://test:v1 -
Make it readable
setfacl -b test.sqsh && chmod 755 test.sqsh -
Log out of the node
-
Replace the 'user' with your username in ae_scripts/clariden_scripts/sailor.toml
-
To get a container with the image running and get a shell
srun -t 5:00:00 -A a-infra02 --container-writable --environment=/capstor/scratch/cscs/$USER/sailor/ae_scripts/clariden_scripts/sailor.toml --pty bash
- Run a simple training job with just 1 GPU to check all works:
cd /root/sailor/third_party/Megatron-DeepSpeed/
export SAILOR_LOGS_DIR=logs
bash run.sh 1 0 127.0.0.1 1234 1 1 1 1 1