Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,37 @@ srun torchrun --nnodes 2
If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.


### Multi-Node Training (SkyPilot)
You can run the same multi-node job on the cloud with [SkyPilot](https://skypilot.readthedocs.io/) using the provided `multinode-trainer.sky.yaml`, which mirrors `multinode_trainer.slurm` (sets node count, per-node GPUs, environment, and launches `torchrun`).

Run:

```bash
# For setup steps, see:
# https://docs.skypilot.co/en/latest/getting-started/installation.html
pip install "skypilot[kubernetes,aws]" # or your cloud: [gcp], [azure], etc.

# Launch a cluster and start training
export HF_TOKEN=... # if using a gated model from the HF Hub
sky launch -c torchtitan-multinode multinode-trainer.sky.yaml --env HF_TOKEN

# Tail logs
sky logs torchtitan-multinode

# Stop the cluster when done
sky down torchtitan-multinode
```

Overrides (optional):

```bash
# Increase number of ndoes and use a different config file without editing sky.yaml
sky launch -c torchtitan-multinode multinode-trainer.sky.yaml \
--num-nodes 4 \
--env HF_TOKEN \
--env CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_70b.toml
```

## Citation

We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques.
Expand Down
45 changes: 45 additions & 0 deletions multinode-trainer.sky.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# SkyPilot configuration for TorchTitan multi-node training
# This configuration reproduces the functionality of multinode_trainer.slurm
#
# To launch:
# sky launch -c torchtitan-cluster sky.yaml
#
# To stop:
# sky down torchtitan-cluster
#
# To monitor:
# sky status --refresh

name: torchtitan-multinode

resources:
accelerators: {H100:8, H200:8}
disk_size: 1024GB

num_nodes: 2

workdir: .

envs:
CONFIG_FILE: "./torchtitan/models/llama3/train_configs/llama3_8b.toml"
HF_TOKEN: ""

setup: |
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
pip install -r requirements.txt
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=$HF_TOKEN
run: |
# Get head node IP (first node in the list)
HEAD_NODE_IP=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Head node IP: $HEAD_NODE_IP"

# SKYPILOT_NODE_RANK is automatically set by SkyPilot
torchrun \
--nnodes $SKYPILOT_NUM_NODES \
--nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank $SKYPILOT_NODE_RANK \
--master_addr=$HEAD_NODE_IP \
--master_port=8008 \
-m torchtitan.train \
--job.config_file $CONFIG_FILE \
--training.dataset c4_test