diff --git a/README.md b/README.md index 30e3ffae3..7dcef9954 100644 --- a/README.md +++ b/README.md @@ -160,6 +160,37 @@ srun torchrun --nnodes 2 If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section. +### Multi-Node Training (SkyPilot) +You can run the same multi-node job on the cloud with [SkyPilot](https://skypilot.readthedocs.io/) using the provided `multinode-trainer.sky.yaml`, which mirrors `multinode_trainer.slurm` (sets node count, per-node GPUs, environment, and launches `torchrun`). + +Run: + +```bash +# For setup steps, see: +# https://docs.skypilot.co/en/latest/getting-started/installation.html +pip install "skypilot[kubernetes,aws]" # or your cloud: [gcp], [azure], etc. + +# Launch a cluster and start training +export HF_TOKEN=... # if using a gated model from the HF Hub +sky launch -c torchtitan-multinode multinode-trainer.sky.yaml --env HF_TOKEN + +# Tail logs +sky logs torchtitan-multinode + +# Stop the cluster when done +sky down torchtitan-multinode +``` + +Overrides (optional): + +```bash +# Increase number of ndoes and use a different config file without editing sky.yaml +sky launch -c torchtitan-multinode multinode-trainer.sky.yaml \ + --num-nodes 4 \ + --env HF_TOKEN \ + --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml +``` + ## Citation We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques. diff --git a/multinode-trainer.sky.yaml b/multinode-trainer.sky.yaml new file mode 100644 index 000000000..887f290c8 --- /dev/null +++ b/multinode-trainer.sky.yaml @@ -0,0 +1,45 @@ +# SkyPilot configuration for TorchTitan multi-node training +# This configuration reproduces the functionality of multinode_trainer.slurm +# +# To launch: +# sky launch -c torchtitan-cluster sky.yaml +# +# To stop: +# sky down torchtitan-cluster +# +# To monitor: +# sky status --refresh + +name: torchtitan-multinode + +resources: + accelerators: {H100:8, H200:8} + disk_size: 1024GB + +num_nodes: 2 + +workdir: . + +envs: + CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml" + HF_TOKEN: "" + +setup: | + pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall + pip install -r requirements.txt + python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN +run: | + # Get head node IP (first node in the list) + HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + echo "Head node IP: $HEAD_NODE_IP" + + # SKYPILOT_NODE_RANK is automatically set by SkyPilot + torchrun \ + --nnodes $SKYPILOT_NUM_NODES \ + --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \ + --node_rank $SKYPILOT_NODE_RANK \ + --master_addr=$HEAD_NODE_IP \ + --master_port=8008 \ + -m torchtitan.train \ + --job.config_file $CONFIG_FILE \ + --training.dataset c4_test