P9 Training Stack Optimization by mohantee · Pull Request #531 · The-School-of-AI/LLM

mohantee · 2026-02-21T12:03:21Z

P9 Training Stack Optimization

Description

Training infrastructure for scaling from 1B dense to 70B MoE models.
Covers compute estimation, data loading, custom GPU kernels, model growth, and checkpoint management.

1. FLOPs & Cost Governor (FLOPS-Calculation/)
Attention-aware compute/memory estimator supporting GQA, GSA, DeltaNet, MLA, and hybrid attention with full MoE accounting
Reversible training analysis — demonstrates 64× batch size increase on 70B MoE (H200 GPUs)
Streamlit dashboard + config presets for 1B variants and multi-stage 1B→70B MoE plans
MoE upcycling cost comparison (slicing vs random projection vs SVD)

2. High-Performance Data Pipeline (data_loader/ + src/data.py)
Three loading modes: offline (load_from_disk), online (download pretokenized), and HuggingFace streaming
SPDL integration with memory-mapped I/O, S3→NVMe staging, and async GPU prefetching
Block packing with multi-size support and domain-aware chunking (no padding waste)
Shard tracking for deterministic checkpoint resume + distributed sampling for multi-GPU

3. Triton Kernel Library (src/kernels/)
Sparse attention (O(T×k)) and gated indexer (6GB → 134MB memory at T=4096) with streaming chunked topk for 256K+ context
Fused RMSNorm (forward + backward) and fused Sinkhorn-Knopp (40 launches → 1)
DeltaNet fla wrapper with fp32 upcast and parameter mapping
All kernels are reversibility-safe with automatic PyTorch fallbacks

4. Dense → MoE Growth (src/growth/growth_utils.py)
SwiGLU-aware SVD compression (joint/independent) preserving gate/up coordinate alignment
Orthogonal rotation for expert diversity + null-biased router initialization
Validation: cosine similarity checks and loss-equivalence under null routing

5. S3 Checkpoint Manager (src/checkpoint.py)
Non-blocking background S3 upload with retry + exponential backoff
Multi-node aware with per-node upload threads and automatic file distribution
Local checkpoint cleanup with configurable retention

Checklist

I have added tests that prove my fix is effective or that my feature works.
I have added necessary documentation (if applicable).
My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

Reviewers

Reviewer 1: A member from your own team.
Reviewer 2: A member from the repo owners team (@The-School-of-AI/llm-repo-owners).

Note: Every pull request requires atleast 2 reviewers/approvers before it can be merged.

- Add zero-2-moe.json and zero-3-moe.json configs with fp16_master_weights_and_grads - Add src/moe_utils.py with is_moe_model() and create_moe_param_groups() - Update main.py to auto-detect MoE models and use proper param groups - Add test/test_moe.py with 12 tests for MoE functionality - Document MoE configuration in README.md - Note: ZeRO-3 + MoE race condition fixed in DeepSpeed 0.18.6 (not yet on PyPI) Related to #207

Updated README to reflect changes in the DeepSpeed training template, including new features, project structure, and installation instructions.

DeepSpeed's native MoE layer (deepspeed.moe.layer.MoE) does NOT support ZeRO-3. This is explicitly blocked in DeepSpeed's runtime/engine.py: 'assert not self.has_moe_layers, MoE not supported with Stage 3' Official docs confirm ZeRO-2 only: https://www.deepspeed.ai/tutorials/mixture-of-experts/ Removed zero-3-moe.json and updated README to reflect this limitation. Related to #207

…AI/LLM into p9/feat/dashboard

* Adding MoE supported model and modifying print statement to print only once while running in distributed env * Updating deepspeed config and adding script to sanity check memory before running the model * Updating add print statements to avoid duplication of printing * Adding checkpointing mechnish with non gpu blocking characters * Adding checkpointing feature * Fixing linting issues * Fixing linting issues * Fixing linting issues * Fixing linting issues * fixing linting issues * Adding simple step to verify EP * Updating code with MoE Qwen * Removing miss leading comments --------- Co-authored-by: yashwant ram m <yash@yashwants-MacBook-Pro.local>

…Ref #156)

…ure-aware modes (Ref #156)

…ange - (Ref #156)

…e it

…ity_test

…art instructions for P9 training stack

…chool-of-AI/LLM into p9/feat/reversibility_test

pankaj1311 · 2026-02-27T14:40:30Z

Please include in the README.md what was met for the charter, deferred scope, reports submitted, and any limitations.

Jayant-Guru-Shrivastava and others added 30 commits January 31, 2026 22:09

updated code for MoE and ZeRO script

41a635f

Revise README for DeepSpeed MoE Training Template

6c1b859

Updated README to reflect changes in the DeepSpeed training template, including new features, project structure, and installation instructions.

Dashboard initial commit

904e134

Merge branch 'p9/feat/dashboard' of https://github.com/The-School-of-…

c7f025e

…AI/LLM into p9/feat/dashboard

feat: add training flops calculator script

d5298bc

feat: Finalize FLOPs calculator with cost governor and quantization (…

54b9309

…Ref #156)

feat: improve FLOPs calc and 70B MoE config

2c13627

chore: apply pre-commit formatting

10f9365

feat: add growth-mode FLOPs calculator (Ref #156)

083cc21

feat: update FLOPs calculator configurations for growth and architect…

ca3912c

…ure-aware modes (Ref #156)

chore: restore non-qwen configs (Ref #156)

8cdc084

chore: pre-commit fixes

a4f954a

Update last_run.txt with training compute details for FLOP and MoE ch…

551d6ce

…ange - (Ref #156)

docs: note FLOPs output snapshots

3747f29

docs: add explanation for null expert probability and how to determin…

4ce1791

…e it

Add FLOPs calculator updates

7630998

fix(flops): align logits/softmax and growth allocation (ref #156)

fc69f70

feat(flops): add gqa/sparsity/recompute knobs (ref #156)

8d8f3e8

feat(flops): model routing and quantization overheads (ref #156)

5bea84b

feat(config): apply 512-expert moe and knobs (ref #156)

6cd75c8

fix(flops): allow solve_for without preset experts (ref #156)

6344be1

feat(config): retune moe layers for 256 experts (ref #156)

4e0cd73

feat(config): set h100 sxm pricing and specs (ref #156)

ab84e6b

feat(config): switch moe to 64 experts (ref #156)

257771f

feat(growth): preserve total token budget (ref #156)

62b082e

feat(ui): add flops web calculator (ref #156)

74ee116

feat(ui): add growth mode toggle (ref #156)

088eda8

mohantee requested review from a team, ShanmugaSuntharam, guptavik and nav13n as code owners February 21, 2026 12:03

mohantee changed the title ~~P9 team changes to Staging~~ P9 Training Stack Optimization Feb 21, 2026

doc strings updated

bad369b

N-Mahesh self-requested a review February 21, 2026 17:10

precommit automatic fixes

e588fc5

N-Mahesh self-assigned this Feb 21, 2026

Precommit manual fixes

e2ecd66

N-Mahesh approved these changes Feb 21, 2026

View reviewed changes

abi2024 changed the base branch from staging to p11/feat/5phase-growth-pipeline February 21, 2026 21:50

N-Mahesh changed the base branch from p11/feat/5phase-growth-pipeline to staging February 22, 2026 16:27

N-Mahesh removed their assignment Feb 23, 2026

nishanthvonteddu requested review from nishanthvonteddu and removed request for nishanthvonteddu February 24, 2026 23:41

nishanthvonteddu approved these changes Feb 24, 2026

View reviewed changes

vj1117 closed this Feb 25, 2026

vj1117 reopened this Feb 25, 2026

N-Mahesh added 5 commits February 26, 2026 07:55

Merge remote-tracking branch 'origin/staging' into p9/feat/reversibil…

4958c93

…ity_test

Merge branch 'staging' into p9/feat/reversibility_test

82e87d3

Chore: Update README and configs with detailed structure and quick st…

52abe88

…art instructions for P9 training stack

Merge branch 'p9/feat/reversibility_test' of https://github.com/The-S…

5a08e9a

…chool-of-AI/LLM into p9/feat/reversibility_test

Merge branch 'staging' into p9/feat/reversibility_test

4432b82

N-Mahesh marked this pull request as draft February 28, 2026 07:41

nishanthvonteddu added 3 commits March 5, 2026 10:34

drag-and-drop for chart cards and removal

ddf741d

Add close and grip SVG

b2322b9

Add styles for chart interactions and drag handles

c1b8608

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P9 Training Stack Optimization #531

P9 Training Stack Optimization #531
mohantee wants to merge 171 commits intostagingfrom
p9/feat/reversibility_test

mohantee commented Feb 21, 2026 •

edited by N-Mahesh

Loading

Uh oh!

pankaj1311 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

mohantee commented Feb 21, 2026 • edited by N-Mahesh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

P9 Training Stack Optimization

Description

Checklist

Reviewers

Uh oh!

pankaj1311 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

mohantee commented Feb 21, 2026 •

edited by N-Mahesh

Loading