Draft
Conversation
- Add zero-2-moe.json and zero-3-moe.json configs with fp16_master_weights_and_grads - Add src/moe_utils.py with is_moe_model() and create_moe_param_groups() - Update main.py to auto-detect MoE models and use proper param groups - Add test/test_moe.py with 12 tests for MoE functionality - Document MoE configuration in README.md - Note: ZeRO-3 + MoE race condition fixed in DeepSpeed 0.18.6 (not yet on PyPI) Related to #207
Updated README to reflect changes in the DeepSpeed training template, including new features, project structure, and installation instructions.
DeepSpeed's native MoE layer (deepspeed.moe.layer.MoE) does NOT support ZeRO-3. This is explicitly blocked in DeepSpeed's runtime/engine.py: 'assert not self.has_moe_layers, MoE not supported with Stage 3' Official docs confirm ZeRO-2 only: https://www.deepspeed.ai/tutorials/mixture-of-experts/ Removed zero-3-moe.json and updated README to reflect this limitation. Related to #207
…AI/LLM into p9/feat/dashboard
* Adding MoE supported model and modifying print statement to print only once while running in distributed env * Updating deepspeed config and adding script to sanity check memory before running the model * Updating add print statements to avoid duplication of printing * Adding checkpointing mechnish with non gpu blocking characters * Adding checkpointing feature * Fixing linting issues * Fixing linting issues * Fixing linting issues * Fixing linting issues * fixing linting issues * Adding simple step to verify EP * Updating code with MoE Qwen * Removing miss leading comments --------- Co-authored-by: yashwant ram m <yash@yashwants-MacBook-Pro.local>
…ure-aware modes (Ref #156)
N-Mahesh
approved these changes
Feb 21, 2026
nishanthvonteddu
approved these changes
Feb 24, 2026
…art instructions for P9 training stack
…chool-of-AI/LLM into p9/feat/reversibility_test
Collaborator
|
Please include in the README.md what was met for the charter, deferred scope, reports submitted, and any limitations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P9 Training Stack Optimization
Description
Training infrastructure for scaling from 1B dense to 70B MoE models.
Covers compute estimation, data loading, custom GPU kernels, model growth, and checkpoint management.
1. FLOPs & Cost Governor (FLOPS-Calculation/)
Attention-aware compute/memory estimator supporting GQA, GSA, DeltaNet, MLA, and hybrid attention with full MoE accounting
Reversible training analysis — demonstrates 64× batch size increase on 70B MoE (H200 GPUs)
Streamlit dashboard + config presets for 1B variants and multi-stage 1B→70B MoE plans
MoE upcycling cost comparison (slicing vs random projection vs SVD)
2. High-Performance Data Pipeline (data_loader/ + src/data.py)
Three loading modes: offline (load_from_disk), online (download pretokenized), and HuggingFace streaming
SPDL integration with memory-mapped I/O, S3→NVMe staging, and async GPU prefetching
Block packing with multi-size support and domain-aware chunking (no padding waste)
Shard tracking for deterministic checkpoint resume + distributed sampling for multi-GPU
3. Triton Kernel Library (src/kernels/)
Sparse attention (O(T×k)) and gated indexer (6GB → 134MB memory at T=4096) with streaming chunked topk for 256K+ context
Fused RMSNorm (forward + backward) and fused Sinkhorn-Knopp (40 launches → 1)
DeltaNet fla wrapper with fp32 upcast and parameter mapping
All kernels are reversibility-safe with automatic PyTorch fallbacks
4. Dense → MoE Growth (src/growth/growth_utils.py)
SwiGLU-aware SVD compression (joint/independent) preserving gate/up coordinate alignment
Orthogonal rotation for expert diversity + null-biased router initialization
Validation: cosine similarity checks and loss-equivalence under null routing
5. S3 Checkpoint Manager (src/checkpoint.py)
Non-blocking background S3 upload with retry + exponential backoff
Multi-node aware with per-node upload threads and automatic file distribution
Local checkpoint cleanup with configurable retention
Checklist
Reviewers