Skip to content

[Feature]: Update torchtitan test case — broken paths, stale configs, missing models and parallelism features #976

@KeitaW

Description

@KeitaW

Feature Description & Motivation

The torchtitan test case (3.test_cases/pytorch/torchtitan/) was written around March 2025 against an early version of pytorch/torchtitan. Since then, torchtitan has had three releases (v0.1.0, v0.2.0, v0.2.1), restructured its directory layout, expanded from 1 model family to 6, and added numerous distributed training features. The test case is now broken out of the box because the config path it references no longer exists upstream.

What's broken

  1. Config path no longer exists — the sbatch script references torchtitan/models/llama/train_configs/llama3_8b.toml, but upstream renamed models/llama/ to models/llama3/ (the old path returns a 404). The test case will fail immediately on a fresh clone.

  2. Float8 TOML keys are stale — the README shows a [float8] section with enable_float8_linear = true, but upstream restructured this to [quantize.linear.float8] and removed the enable_float8_linear key. Following the README instructions will produce a config error.

  3. CUDA version mismatchLD_PRELOAD points to /usr/local/cuda-12.1/lib/libnccl.so but pip installs from cu124 (CUDA 12.4). The latest torchtitan v0.2.1 uses cu126 (CUDA 12.6).

What's outdated

  1. Unpinned versions everywheregit clone clones torchtitan at HEAD with no tag/commit, pip3 install --pre torch pulls whatever nightly is current, and pip install --pre torchao likewise. Torchtitan now has releases (v0.1.0, v0.2.0, v0.2.1) that pin compatible torch + torchao versions, which should be used instead.

  2. Only Llama 3.1 8B — upstream now provides training configs for 6 model families:

    Model Configs available Notes
    Llama 3.1 8B, 70B, 405B Already partially covered
    Llama 4 (MoE) 17Bx16E, 17Bx128E Mixture-of-Experts with expert parallelism
    DeepSeek-V3 16B, 671B MoE architecture
    Qwen 3 0.6B, 1.7B, 32B, MoE Dense and MoE variants
    Flux dev, schnell Image generation (diffusion)
    GPT-OSS debug HF checkpoint loading
  3. Key features list is incomplete — the README lists 6 upstream features. Upstream now advertises 16+, including many that are directly relevant to HyperPod users:

    Upstream feature In ADT test case?
    FSDP2 with per-param sharding Yes (default config)
    FP8 via torchao Partially (stale TOML keys)
    torch.compile Mentioned but not demonstrated
    Async Tensor Parallelism Mentioned but tp_degree=1 in config
    Pipeline Parallelism (zero-bubble) Mentioned but pp_degree=1 in config
    Context Parallelism Mentioned but cp_degree=1 in config
    MXFP8 (Blackwell GPUs) No
    DDP / HSDP No
    TorchFT (fault-tolerant elastic training) No
    Distributed Checkpointing (async DCP) No (enable = false in config)
    Activation Checkpointing No (not demonstrated)
    Gradient Accumulation No
    WandB logging No
    Debugging / profiling tools No
    Distributed inference No
  4. No Dockerfile / container option — the test case uses a conda environment only. Other ADT test cases (FSDP, DeepSpeed, picotron, TRL, verl) provide Dockerfiles for reproducible container-based execution with Pyxis/Enroot.

Category

Enhancement to existing test case

Alternatives Considered

No alternatives — this is about updating the existing test case to reflect upstream's current state.

Additional Context

Affected files

File Issues
3.test_cases/pytorch/torchtitan/README.md Stale key-features list; stale float8 TOML example
3.test_cases/pytorch/torchtitan/slurm/README.md References old models/llama/ path; stale float8 config instructions
3.test_cases/pytorch/torchtitan/slurm/0.create_conda_env.sh Unpinned git clone; unpinned nightly torch/torchao; uses cu124 but should match release
3.test_cases/pytorch/torchtitan/slurm/1.llama_3_8b_torchtitan.sh Broken config path (models/llama/models/llama3/); LD_PRELOAD CUDA 12.1 mismatch

Suggested fixes (prioritized)

Priority Fix Effort
P0 Fix broken config path: models/llama/models/llama3/ Trivial
P0 Pin torchtitan to a release tag (e.g., git checkout v0.2.1) and install matching torch + torchao versions from the release notes Small
P0 Fix CUDA version in LD_PRELOAD to match the installed CUDA toolkit Trivial
P1 Update float8 config example to use [quantize.linear.float8] Trivial
P1 Add configs demonstrating TP, PP, and CP (e.g., a llama3_70b config with pp_degree=4, tp_degree=2) Medium
P1 Add activation checkpointing and async DCP examples Small
P2 Add a Dockerfile for container-based execution (Pyxis/Enroot compatible) Medium
P2 Add configs for additional model families (Llama 4 MoE, DeepSeek-V3, Qwen 3) Medium
P2 Document torch.compile and FP8 as first-class config options rather than afterthought "optimization tips" Small
P3 Add WandB logging configuration example Small
P3 Update key-features list in the top-level README to match upstream Trivial

Upstream reference

  • Repo: pytorch/torchtitan — 5,076 stars, actively developed (daily commits)
  • Latest release: v0.2.1 (2025-12-26), requires torch-2.11.0.dev20251226+cu126 + torchao-0.16.0.dev20251226+cu126
  • Models directory: torchtitan/models/{llama3, llama4, deepseek_v3, qwen3, flux, gpt_oss}
  • Config format: TOML with sections [job], [model], [training], [parallelism], [compile], [activation_checkpoint], [quantize.linear.float8], [checkpoint], [validation]

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions