-
Notifications
You must be signed in to change notification settings - Fork 165
Description
Feature Description & Motivation
The torchtitan test case (3.test_cases/pytorch/torchtitan/) was written around March 2025 against an early version of pytorch/torchtitan. Since then, torchtitan has had three releases (v0.1.0, v0.2.0, v0.2.1), restructured its directory layout, expanded from 1 model family to 6, and added numerous distributed training features. The test case is now broken out of the box because the config path it references no longer exists upstream.
What's broken
-
Config path no longer exists — the sbatch script references
torchtitan/models/llama/train_configs/llama3_8b.toml, but upstream renamedmodels/llama/tomodels/llama3/(the old path returns a 404). The test case will fail immediately on a fresh clone. -
Float8 TOML keys are stale — the README shows a
[float8]section withenable_float8_linear = true, but upstream restructured this to[quantize.linear.float8]and removed theenable_float8_linearkey. Following the README instructions will produce a config error. -
CUDA version mismatch —
LD_PRELOADpoints to/usr/local/cuda-12.1/lib/libnccl.sobut pip installs fromcu124(CUDA 12.4). The latest torchtitan v0.2.1 usescu126(CUDA 12.6).
What's outdated
-
Unpinned versions everywhere —
git cloneclones torchtitan at HEAD with no tag/commit,pip3 install --pre torchpulls whatever nightly is current, andpip install --pre torchaolikewise. Torchtitan now has releases (v0.1.0, v0.2.0, v0.2.1) that pin compatible torch + torchao versions, which should be used instead. -
Only Llama 3.1 8B — upstream now provides training configs for 6 model families:
Model Configs available Notes Llama 3.1 8B, 70B, 405B Already partially covered Llama 4 (MoE) 17Bx16E, 17Bx128E Mixture-of-Experts with expert parallelism DeepSeek-V3 16B, 671B MoE architecture Qwen 3 0.6B, 1.7B, 32B, MoE Dense and MoE variants Flux dev, schnell Image generation (diffusion) GPT-OSS debug HF checkpoint loading -
Key features list is incomplete — the README lists 6 upstream features. Upstream now advertises 16+, including many that are directly relevant to HyperPod users:
Upstream feature In ADT test case? FSDP2 with per-param sharding Yes (default config) FP8 via torchao Partially (stale TOML keys) torch.compile Mentioned but not demonstrated Async Tensor Parallelism Mentioned but tp_degree=1in configPipeline Parallelism (zero-bubble) Mentioned but pp_degree=1in configContext Parallelism Mentioned but cp_degree=1in configMXFP8 (Blackwell GPUs) No DDP / HSDP No TorchFT (fault-tolerant elastic training) No Distributed Checkpointing (async DCP) No ( enable = falsein config)Activation Checkpointing No (not demonstrated) Gradient Accumulation No WandB logging No Debugging / profiling tools No Distributed inference No -
No Dockerfile / container option — the test case uses a conda environment only. Other ADT test cases (FSDP, DeepSpeed, picotron, TRL, verl) provide Dockerfiles for reproducible container-based execution with Pyxis/Enroot.
Category
Enhancement to existing test case
Alternatives Considered
No alternatives — this is about updating the existing test case to reflect upstream's current state.
Additional Context
Affected files
| File | Issues |
|---|---|
3.test_cases/pytorch/torchtitan/README.md |
Stale key-features list; stale float8 TOML example |
3.test_cases/pytorch/torchtitan/slurm/README.md |
References old models/llama/ path; stale float8 config instructions |
3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.sh |
Unpinned git clone; unpinned nightly torch/torchao; uses cu124 but should match release |
3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh |
Broken config path (models/llama/ → models/llama3/); LD_PRELOAD CUDA 12.1 mismatch |
Suggested fixes (prioritized)
| Priority | Fix | Effort |
|---|---|---|
| P0 | Fix broken config path: models/llama/ → models/llama3/ |
Trivial |
| P0 | Pin torchtitan to a release tag (e.g., git checkout v0.2.1) and install matching torch + torchao versions from the release notes |
Small |
| P0 | Fix CUDA version in LD_PRELOAD to match the installed CUDA toolkit |
Trivial |
| P1 | Update float8 config example to use [quantize.linear.float8] |
Trivial |
| P1 | Add configs demonstrating TP, PP, and CP (e.g., a llama3_70b config with pp_degree=4, tp_degree=2) |
Medium |
| P1 | Add activation checkpointing and async DCP examples | Small |
| P2 | Add a Dockerfile for container-based execution (Pyxis/Enroot compatible) | Medium |
| P2 | Add configs for additional model families (Llama 4 MoE, DeepSeek-V3, Qwen 3) | Medium |
| P2 | Document torch.compile and FP8 as first-class config options rather than afterthought "optimization tips" | Small |
| P3 | Add WandB logging configuration example | Small |
| P3 | Update key-features list in the top-level README to match upstream | Trivial |
Upstream reference
- Repo: pytorch/torchtitan — 5,076 stars, actively developed (daily commits)
- Latest release: v0.2.1 (2025-12-26), requires
torch-2.11.0.dev20251226+cu126+torchao-0.16.0.dev20251226+cu126 - Models directory:
torchtitan/models/{llama3, llama4, deepseek_v3, qwen3, flux, gpt_oss} - Config format: TOML with sections
[job],[model],[training],[parallelism],[compile],[activation_checkpoint],[quantize.linear.float8],[checkpoint],[validation]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status