Skip to content

P9 Training Stack Optimization #531

Draft
mohantee wants to merge 171 commits intostagingfrom
p9/feat/reversibility_test
Draft

P9 Training Stack Optimization #531
mohantee wants to merge 171 commits intostagingfrom
p9/feat/reversibility_test

Conversation

@mohantee
Copy link
Contributor

@mohantee mohantee commented Feb 21, 2026

P9 Training Stack Optimization

Description

Training infrastructure for scaling from 1B dense to 70B MoE models.
Covers compute estimation, data loading, custom GPU kernels, model growth, and checkpoint management.

1. FLOPs & Cost Governor (FLOPS-Calculation/)
Attention-aware compute/memory estimator supporting GQA, GSA, DeltaNet, MLA, and hybrid attention with full MoE accounting
Reversible training analysis — demonstrates 64× batch size increase on 70B MoE (H200 GPUs)
Streamlit dashboard + config presets for 1B variants and multi-stage 1B→70B MoE plans
MoE upcycling cost comparison (slicing vs random projection vs SVD)

2. High-Performance Data Pipeline (data_loader/ + src/data.py)
Three loading modes: offline (load_from_disk), online (download pretokenized), and HuggingFace streaming
SPDL integration with memory-mapped I/O, S3→NVMe staging, and async GPU prefetching
Block packing with multi-size support and domain-aware chunking (no padding waste)
Shard tracking for deterministic checkpoint resume + distributed sampling for multi-GPU

3. Triton Kernel Library (src/kernels/)
Sparse attention (O(T×k)) and gated indexer (6GB → 134MB memory at T=4096) with streaming chunked topk for 256K+ context
Fused RMSNorm (forward + backward) and fused Sinkhorn-Knopp (40 launches → 1)
DeltaNet fla wrapper with fp32 upcast and parameter mapping
All kernels are reversibility-safe with automatic PyTorch fallbacks

4. Dense → MoE Growth (src/growth/growth_utils.py)
SwiGLU-aware SVD compression (joint/independent) preserving gate/up coordinate alignment
Orthogonal rotation for expert diversity + null-biased router initialization
Validation: cosine similarity checks and loss-equivalence under null routing

5. S3 Checkpoint Manager (src/checkpoint.py)
Non-blocking background S3 upload with retry + exponential backoff
Multi-node aware with per-node upload threads and automatic file distribution
Local checkpoint cleanup with configurable retention

Checklist

  • I have added tests that prove my fix is effective or that my feature works.
  • I have added necessary documentation (if applicable).
  • My code follows the style guidelines, gitflow branching strategy, and naming conventions of this project [Contribution Guidelines](https://github.com/The-School-of-AI/LLM/tree/main/experiments/

Reviewers

  • Reviewer 1: A member from your own team.
  • Reviewer 2: A member from the repo owners team (@The-School-of-AI/llm-repo-owners).

Note: Every pull request requires atleast 2 reviewers/approvers before it can be merged.

Jayant-Guru-Shrivastava and others added 30 commits January 31, 2026 22:09
- Add zero-2-moe.json and zero-3-moe.json configs with fp16_master_weights_and_grads
- Add src/moe_utils.py with is_moe_model() and create_moe_param_groups()
- Update main.py to auto-detect MoE models and use proper param groups
- Add test/test_moe.py with 12 tests for MoE functionality
- Document MoE configuration in README.md
- Note: ZeRO-3 + MoE race condition fixed in DeepSpeed 0.18.6 (not yet on PyPI)

Related to #207
Updated README to reflect changes in the DeepSpeed training template, including new features, project structure, and installation instructions.
DeepSpeed's native MoE layer (deepspeed.moe.layer.MoE) does NOT support ZeRO-3.
This is explicitly blocked in DeepSpeed's runtime/engine.py:
  'assert not self.has_moe_layers, MoE not supported with Stage 3'

Official docs confirm ZeRO-2 only:
https://www.deepspeed.ai/tutorials/mixture-of-experts/

Removed zero-3-moe.json and updated README to reflect this limitation.

Related to #207
* Adding MoE supported model and modifying print statement to print only once while running in distributed env

* Updating deepspeed config and adding script to sanity check memory before running the model

* Updating add print statements to avoid duplication of printing

* Adding checkpointing mechnish with non gpu blocking characters

* Adding checkpointing feature

* Fixing linting issues

* Fixing linting issues

* Fixing linting issues

* Fixing linting issues

* fixing linting issues

* Adding simple step to verify EP

* Updating code with MoE Qwen

* Removing miss leading comments

---------

Co-authored-by: yashwant ram m <yash@yashwants-MacBook-Pro.local>
@mohantee mohantee changed the title P9 team changes to Staging P9 Training Stack Optimization Feb 21, 2026
@N-Mahesh N-Mahesh self-requested a review February 21, 2026 17:10
@N-Mahesh N-Mahesh self-assigned this Feb 21, 2026
@abi2024 abi2024 changed the base branch from staging to p11/feat/5phase-growth-pipeline February 21, 2026 21:50
@N-Mahesh N-Mahesh changed the base branch from p11/feat/5phase-growth-pipeline to staging February 22, 2026 16:27
@N-Mahesh N-Mahesh removed their assignment Feb 23, 2026
@nishanthvonteddu nishanthvonteddu requested review from nishanthvonteddu and removed request for nishanthvonteddu February 24, 2026 23:41
@vj1117 vj1117 closed this Feb 25, 2026
@vj1117 vj1117 reopened this Feb 25, 2026
@pankaj1311
Copy link
Collaborator

Please include in the README.md what was met for the charter, deferred scope, reports submitted, and any limitations.

@N-Mahesh N-Mahesh marked this pull request as draft February 28, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.