pytorch · alex000kim · Aug 13, 2025 · Aug 13, 2025
@@ -160,6 +160,37 @@ srun torchrun --nnodes 2
 If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.
 
 
+### Multi-Node Training (SkyPilot)
+You can run the same multi-node job on the cloud with [SkyPilot](https://skypilot.readthedocs.io/) using the provided `multinode-trainer.sky.yaml`, which mirrors `multinode_trainer.slurm` (sets node count, per-node GPUs, environment, and launches `torchrun`).
+
+Run:
+
+```bash
+# For setup steps, see:
+# https://docs.skypilot.co/en/latest/getting-started/installation.html 
+pip install "skypilot[kubernetes,aws]"  # or your cloud: [gcp], [azure], etc.
+
+# Launch a cluster and start training
+export HF_TOKEN=... # if using a gated model from the HF Hub
+sky launch -c torchtitan-multinode multinode-trainer.sky.yaml --env HF_TOKEN
+
+# Tail logs
+sky logs torchtitan-multinode
+
+# Stop the cluster when done
+sky down torchtitan-multinode
+```
+
+Overrides (optional):
+
+```bash
+# Increase number of ndoes and use a different config file without editing sky.yaml
+sky launch -c torchtitan-multinode multinode-trainer.sky.yaml \
+   --num-nodes 4 \
+   --env HF_TOKEN \
+   --env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml
+```
+
 ## Citation
 
 We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques.

@@ -0,0 +1,45 @@
+# SkyPilot configuration for TorchTitan multi-node training
+# This configuration reproduces the functionality of multinode_trainer.slurm
+#
+# To launch:
+#   sky launch -c torchtitan-cluster sky.yaml
+#
+# To stop:
+#   sky down torchtitan-cluster
+#
+# To monitor:
+#   sky status --refresh
+
+name: torchtitan-multinode
+
+resources:
+  accelerators: {H100:8, H200:8}
+  disk_size: 1024GB
+
+num_nodes: 2
+
+workdir: .
+
+envs:
+  CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml"
+  HF_TOKEN: ""
+
+setup: |
+  pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
+  pip install -r requirements.txt
+  python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN
+run: |
+  # Get head node IP (first node in the list)
+  HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+  echo "Head node IP: $HEAD_NODE_IP"
+
+  # SKYPILOT_NODE_RANK is automatically set by SkyPilot
+  torchrun \
+    --nnodes $SKYPILOT_NUM_NODES \
+    --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
+    --node_rank $SKYPILOT_NODE_RANK \
+    --master_addr=$HEAD_NODE_IP \
+    --master_port=8008 \
+    -m torchtitan.train \
+    --job.config_file $CONFIG_FILE \
+    --training.dataset c4_test