-
Notifications
You must be signed in to change notification settings - Fork 165
Description
Feature Description & Motivation
The picotron test case (3.test_cases/pytorch/picotron/) has several issues that make builds non-reproducible, documentation inaccurate, and multi-node runs unreliable on EFA-equipped instances. Other test cases in the repo (e.g., FSDP, DeepSpeed, TRL) have matured past these problems and can serve as reference implementations.
Key problems:
-
Non-reproducible Docker builds — the Dockerfile clones
huggingface/picotronat HEAD (git clone https://github.com/huggingface/picotron) with no pinned tag, branch, or commit hash. Upstream picotron has no releases or tags, so any upstream change silently alters the built image. Upstream is also pinned totorch==2.1.0(Oct 2023), which is considerably older than what other test cases use. -
Documentation path mismatch — the Slurm README (
SmolLM-1.7B/slurm/README.md) says the config will be created at./conf/llama-1B-dp2-tp2-pp2/config.json, but thecreate_config.pycommand shown uses--exp_name llama-1B-tp2, which produces./conf/llama-1B-tp2/config.json. The sbatch script correctly referencesllama-1B-tp2, so the README text is wrong. -
Duplicate prerequisite in EC2 README —
SmolLM-1.7B/ec2/README.mdlists "Build thepicotroncontainer image" as both prerequisite Bump numpy from 1.21.5 to 1.22.0 in /3.test_cases/4.DDP #1 and prerequisite Update README.md #3. -
No EFA/NCCL tuning for multi-node —
train.sbatchhas no EFA environment variables (FI_PROVIDER,FI_EFA_USE_HUGE_PAGE, etc.) and no NCCL tuning (NCCL_DEBUG,NCCL_BUFFSIZE,NCCL_TUNER_PLUGIN). Other test cases (FSDP, TRL) set these. Without them, multi-node runs on EFA-equipped instances (p4d, p5) may fall back to TCP or hit silent performance issues. -
create_config.pydepends on upstream'stemplate/base_config.json— this file is not in the ADT repo; it comes from thegit cloneinside the container. This is technically correct (the Dockerfile setsWORKDIR /picotron) but is confusing and meanscreate_config.pycannot be used outside the container without also cloning picotron.
Category
Enhancement to existing test case
Alternatives Considered
No alternatives — these are improvements to the existing test case.
Additional Context
Affected files
| File | Issue |
|---|---|
3.test_cases/pytorch/picotron/picotron.Dockerfile |
Unpinned git clone of picotron; torch==2.1.0 |
3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/README.md |
Config path says llama-1B-dp2-tp2-pp2 but should be llama-1B-tp2 |
3.test_cases/pytorch/picotron/SmolLM-1.7B/ec2/README.md |
Duplicate prerequisite (#1 and #3 both say "build the container") |
3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/train.sbatch |
Missing EFA/NCCL environment variables for multi-node |
3.test_cases/pytorch/picotron/create_config.py |
Implicit dependency on upstream template/base_config.json |
Upstream reference
- Upstream repo: huggingface/picotron — no releases/tags; latest commit
59714b1bf255(2025-08-26) - Upstream pins
torch==2.1.0,triton==2.1.0,flash-attn==2.5.0,transformers==4.47.0 - Upstream has no Dockerfile — the one in ADT is the only containerized build
Suggested fixes
| # | Fix | Effort |
|---|---|---|
| 1 | Pin picotron to a specific commit hash in the Dockerfile (e.g., git clone ... && cd picotron && git checkout 59714b1bf255) |
Small |
| 2 | Fix the config path in the Slurm README | Trivial |
| 3 | Remove the duplicate prerequisite in the EC2 README | Trivial |
| 4 | Add standard EFA/NCCL environment variables to train.sbatch (see FSDP's sbatch files for reference) |
Small |
| 5 | Either vendor template/base_config.json into the ADT repo or document the container-only constraint in the README |
Small |
Patterns from other test cases
For reference, the FSDP and DeepSpeed test cases:
- Pin all upstream repos to specific versions/tags/commits
- Set EFA environment variables (
FI_PROVIDER=efa,FI_EFA_USE_HUGE_PAGE=0,NCCL_DEBUG=WARN) - Separate
--outputand--errorin sbatch - Provide per-model config generation with Jinja2 templates
Metadata
Metadata
Assignees
Labels
Type
Projects
Status