Skip to content

[Feature]: Harden picotron test case — pin version, fix docs, add EFA/NCCL config #975

@KeitaW

Description

@KeitaW

Feature Description & Motivation

The picotron test case (3.test_cases/pytorch/picotron/) has several issues that make builds non-reproducible, documentation inaccurate, and multi-node runs unreliable on EFA-equipped instances. Other test cases in the repo (e.g., FSDP, DeepSpeed, TRL) have matured past these problems and can serve as reference implementations.

Key problems:

  1. Non-reproducible Docker builds — the Dockerfile clones huggingface/picotron at HEAD (git clone https://github.com/huggingface/picotron) with no pinned tag, branch, or commit hash. Upstream picotron has no releases or tags, so any upstream change silently alters the built image. Upstream is also pinned to torch==2.1.0 (Oct 2023), which is considerably older than what other test cases use.

  2. Documentation path mismatch — the Slurm README (SmolLM-1.7B/slurm/README.md) says the config will be created at ./conf/llama-1B-dp2-tp2-pp2/config.json, but the create_config.py command shown uses --exp_name llama-1B-tp2, which produces ./conf/llama-1B-tp2/config.json. The sbatch script correctly references llama-1B-tp2, so the README text is wrong.

  3. Duplicate prerequisite in EC2 READMESmolLM-1.7B/ec2/README.md lists "Build the picotron container image" as both prerequisite Bump numpy from 1.21.5 to 1.22.0 in /3.test_cases/4.DDP #1 and prerequisite Update README.md #3.

  4. No EFA/NCCL tuning for multi-nodetrain.sbatch has no EFA environment variables (FI_PROVIDER, FI_EFA_USE_HUGE_PAGE, etc.) and no NCCL tuning (NCCL_DEBUG, NCCL_BUFFSIZE, NCCL_TUNER_PLUGIN). Other test cases (FSDP, TRL) set these. Without them, multi-node runs on EFA-equipped instances (p4d, p5) may fall back to TCP or hit silent performance issues.

  5. create_config.py depends on upstream's template/base_config.json — this file is not in the ADT repo; it comes from the git clone inside the container. This is technically correct (the Dockerfile sets WORKDIR /picotron) but is confusing and means create_config.py cannot be used outside the container without also cloning picotron.

Category

Enhancement to existing test case

Alternatives Considered

No alternatives — these are improvements to the existing test case.

Additional Context

Affected files

File Issue
3.test_cases/pytorch/picotron/picotron.Dockerfile Unpinned git clone of picotron; torch==2.1.0
3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/README.md Config path says llama-1B-dp2-tp2-pp2 but should be llama-1B-tp2
3.test_cases/pytorch/picotron/SmolLM-1.7B/ec2/README.md Duplicate prerequisite (#1 and #3 both say "build the container")
3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/train.sbatch Missing EFA/NCCL environment variables for multi-node
3.test_cases/pytorch/picotron/create_config.py Implicit dependency on upstream template/base_config.json

Upstream reference

  • Upstream repo: huggingface/picotron — no releases/tags; latest commit 59714b1bf255 (2025-08-26)
  • Upstream pins torch==2.1.0, triton==2.1.0, flash-attn==2.5.0, transformers==4.47.0
  • Upstream has no Dockerfile — the one in ADT is the only containerized build

Suggested fixes

# Fix Effort
1 Pin picotron to a specific commit hash in the Dockerfile (e.g., git clone ... && cd picotron && git checkout 59714b1bf255) Small
2 Fix the config path in the Slurm README Trivial
3 Remove the duplicate prerequisite in the EC2 README Trivial
4 Add standard EFA/NCCL environment variables to train.sbatch (see FSDP's sbatch files for reference) Small
5 Either vendor template/base_config.json into the ADT repo or document the container-only constraint in the README Small

Patterns from other test cases

For reference, the FSDP and DeepSpeed test cases:

  • Pin all upstream repos to specific versions/tags/commits
  • Set EFA environment variables (FI_PROVIDER=efa, FI_EFA_USE_HUGE_PAGE=0, NCCL_DEBUG=WARN)
  • Separate --output and --error in sbatch
  • Provide per-model config generation with Jinja2 templates

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions