[Feature]: Harden picotron test case — pin version, fix docs, add EFA/NCCL config

### Feature Description & Motivation

The picotron test case (`3.test_cases/pytorch/picotron/`) has several issues that make builds non-reproducible, documentation inaccurate, and multi-node runs unreliable on EFA-equipped instances. Other test cases in the repo (e.g., FSDP, DeepSpeed, TRL) have matured past these problems and can serve as reference implementations.

**Key problems:**

1. **Non-reproducible Docker builds** — the Dockerfile clones `huggingface/picotron` at HEAD (`git clone https://github.com/huggingface/picotron`) with no pinned tag, branch, or commit hash. Upstream picotron has no releases or tags, so any upstream change silently alters the built image. Upstream is also pinned to `torch==2.1.0` (Oct 2023), which is considerably older than what other test cases use.

2. **Documentation path mismatch** — the Slurm README (`SmolLM-1.7B/slurm/README.md`) says the config will be created at `./conf/llama-1B-dp2-tp2-pp2/config.json`, but the `create_config.py` command shown uses `--exp_name llama-1B-tp2`, which produces `./conf/llama-1B-tp2/config.json`. The sbatch script correctly references `llama-1B-tp2`, so the README text is wrong.

3. **Duplicate prerequisite in EC2 README** — `SmolLM-1.7B/ec2/README.md` lists "Build the `picotron` container image" as both prerequisite #1 and prerequisite #3.

4. **No EFA/NCCL tuning for multi-node** — `train.sbatch` has no EFA environment variables (`FI_PROVIDER`, `FI_EFA_USE_HUGE_PAGE`, etc.) and no NCCL tuning (`NCCL_DEBUG`, `NCCL_BUFFSIZE`, `NCCL_TUNER_PLUGIN`). Other test cases (FSDP, TRL) set these. Without them, multi-node runs on EFA-equipped instances (p4d, p5) may fall back to TCP or hit silent performance issues.

5. **`create_config.py` depends on upstream's `template/base_config.json`** — this file is not in the ADT repo; it comes from the `git clone` inside the container. This is technically correct (the Dockerfile sets `WORKDIR /picotron`) but is confusing and means `create_config.py` cannot be used outside the container without also cloning picotron.

### Category

Enhancement to existing test case

### Alternatives Considered

_No alternatives — these are improvements to the existing test case._

### Additional Context

#### Affected files

| File | Issue |
|------|-------|
| `3.test_cases/pytorch/picotron/picotron.Dockerfile` | Unpinned `git clone` of picotron; `torch==2.1.0` |
| `3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/README.md` | Config path says `llama-1B-dp2-tp2-pp2` but should be `llama-1B-tp2` |
| `3.test_cases/pytorch/picotron/SmolLM-1.7B/ec2/README.md` | Duplicate prerequisite (#1 and #3 both say "build the container") |
| `3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/train.sbatch` | Missing EFA/NCCL environment variables for multi-node |
| `3.test_cases/pytorch/picotron/create_config.py` | Implicit dependency on upstream `template/base_config.json` |

#### Upstream reference

- Upstream repo: [huggingface/picotron](https://github.com/huggingface/picotron) — no releases/tags; latest commit `59714b1bf255` (2025-08-26)
- Upstream pins `torch==2.1.0`, `triton==2.1.0`, `flash-attn==2.5.0`, `transformers==4.47.0`
- Upstream has no Dockerfile — the one in ADT is the only containerized build

#### Suggested fixes

| # | Fix | Effort |
|---|-----|--------|
| 1 | Pin picotron to a specific commit hash in the Dockerfile (e.g., `git clone ... && cd picotron && git checkout 59714b1bf255`) | Small |
| 2 | Fix the config path in the Slurm README | Trivial |
| 3 | Remove the duplicate prerequisite in the EC2 README | Trivial |
| 4 | Add standard EFA/NCCL environment variables to `train.sbatch` (see FSDP's sbatch files for reference) | Small |
| 5 | Either vendor `template/base_config.json` into the ADT repo or document the container-only constraint in the README | Small |

#### Patterns from other test cases

For reference, the FSDP and DeepSpeed test cases:
- Pin all upstream repos to specific versions/tags/commits
- Set EFA environment variables (`FI_PROVIDER=efa`, `FI_EFA_USE_HUGE_PAGE=0`, `NCCL_DEBUG=WARN`)
- Separate `--output` and `--error` in sbatch
- Provide per-model config generation with Jinja2 templates


File	Issue
`3.test_cases/pytorch/picotron/picotron.Dockerfile`	Unpinned `git clone` of picotron; `torch==2.1.0`
`3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/README.md`	Config path says `llama-1B-dp2-tp2-pp2` but should be `llama-1B-tp2`
`3.test_cases/pytorch/picotron/SmolLM-1.7B/ec2/README.md`	Duplicate prerequisite (#1 and #3 both say "build the container")
`3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm/train.sbatch`	Missing EFA/NCCL environment variables for multi-node
`3.test_cases/pytorch/picotron/create_config.py`	Implicit dependency on upstream `template/base_config.json`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Harden picotron test case — pin version, fix docs, add EFA/NCCL config #975

Feature Description & Motivation

Category

Alternatives Considered

Additional Context

Affected files

Upstream reference

Suggested fixes

Patterns from other test cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Fix	Effort
1	Pin picotron to a specific commit hash in the Dockerfile (e.g., `git clone ... && cd picotron && git checkout 59714b1bf255`)	Small
2	Fix the config path in the Slurm README	Trivial
3	Remove the duplicate prerequisite in the EC2 README	Trivial
4	Add standard EFA/NCCL environment variables to `train.sbatch` (see FSDP's sbatch files for reference)	Small
5	Either vendor `template/base_config.json` into the ADT repo or document the container-only constraint in the README	Small

[Feature]: Harden picotron test case — pin version, fix docs, add EFA/NCCL config #975

Description

Feature Description & Motivation

Category

Alternatives Considered

Additional Context

Affected files

Upstream reference

Suggested fixes

Patterns from other test cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions