Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
| `tensorrt_llm/executor/executor.py` | Execution abstraction (`GenerationExecutor`) |
| `tensorrt_llm/models/automodel.py` | Auto-discovery and model registry |
| `tensorrt_llm/_torch/models/` | PyTorch backend model implementations (distinct from `models/` used by TensorRT backend) |
| `tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md` | MoE architecture, backends, communication, development patterns — **read before modifying MoE code** |
| `CODING_GUIDELINES.md` | C++ and Python coding standards (referenced throughout, must read before contributing) |

## Design Patterns
Expand Down
165 changes: 165 additions & 0 deletions tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# MoE Developer Guide

## Architecture

### MoE Layer in Model

```text
Input Hidden States
├──────────────────────┐
│ │
▼ ▼
fc_gate (Router) Shared Expert (optional)
│ │
▼ │
Fused-MoE │
┌─────────────────────┐ │
│ Routing (topK, etc) │ │
│ │ │ │
│ ▼ │ │
│ MoE Backends │ │
│ (FC1→Act→FC2) │ │
│ │ │ │
│ Apply Weights │ │
└─────────────────────┘ │
│ │
▼ ▼
Combine Outputs (sum) ◄───┘
Final Hidden States
```

### ConfigurableMoE: The Orchestrator

ConfigurableMoE composes independent components via composition (not inheritance):

```text
ConfigurableMoE
├── Backend (pure computation): routing → quantize → FC1 → activation → FC2
├── Communication (distributed): dispatch tokens → compute → combine results
├── EPLB (optional): dynamic expert migration across GPUs
└── Multi-chunk: splits tokens into chunks to reduce peak memory usage
```

Execution flow within ConfigurableMoE (`_forward_chunk_impl`):

```text
routing() → [EPLB] → quantize/dispatch (adaptive order) → backend.run_moe() → combine()
│ │
Communication Communication

Adaptive order (based on comm.supports_post_quant_dispatch()):
Post-quant flow: quantize_input() → comm.dispatch() (send quantized data)
Pre-quant flow: comm.dispatch() → quantize_input() (send raw, quantize locally)
```

### Core Design Principles

1. **Composition over inheritance** — Backend, Communication, and EPLB are independent, composable components
2. **Any Backend × Any Communication × EPLB On/Off** — All valid combinations should work
3. **Backend = pure computation** — No communication logic, no EPLB logic inside backends
4. **Communication is pluggable** — Auto-selected at runtime by `CommunicationFactory` based on hardware and workload
5. **Backend declares capabilities** — `can_implement()` declares what it supports; ConfigurableMoE adapts flow accordingly

## Architecture Transition (IMPORTANT)

The codebase is transitioning between two architectures:

| | Old Path | New Path |
|---|---|---|
| Entry | `XXFusedMoE` (e.g., `CutlassFusedMoE`) | `ConfigurableMoE` + `XXBackend` |
| Communication | Embedded inside each backend | Separated into `communication/` |
| EPLB | Only in WideEPMoE | Available to all backends |
| Status | Being replaced | Active development |

ConfigurableMoE currently supports these backends (`create_moe.py`):
- CutlassFusedMoE, TRTLLMGenFusedMoE, DeepGemmFusedMoE, CuteDslFusedMoE

Still on old path (standalone, with embedded communication):
- TritonFusedMoE, WideEPMoE, VanillaMoE

**Rule: All new features should target ConfigurableMoE + Backend architecture.**

## File Map

### Core (`fused_moe/`)

| File | Role |
|------|------|
| `configurable_moe.py` | Orchestrator — wires Backend + Communication + EPLB + multi-chunk |
| `create_moe.py` | Factory — selects MoE class based on `model_config.moe_backend` |
| `interface.py` | Base class `MoE` and enums (`MoEWeightLoadingMode`, `AlltoallMethodType`) |
| `quantization.py` | Quantization method implementations (`FusedMoEMethod` subclasses: weight creation, loading, quant/dequant ops per quant mode) |
| `routing.py` | Routing methods (`TopKRouting`, etc.) |
| `moe_load_balancer.py` | EPLB implementation |
| `moe_op_backend.py` | Op backend registry for TRTLLMGen (flashinfer/trtllm ops) |

### Backends (`fused_moe/`)

| File | Backend | Hardware | Scenario |
|------|---------|----------|----------|
| `fused_moe_cutlass.py` | CutlassFusedMoE | SM80+ | High throughput, most comprehensive quant support |
| `fused_moe_trtllm_gen.py` | TRTLLMGenFusedMoE | SM100/SM103 | Min-latency and high-throughput on Blackwell |
| `fused_moe_deepgemm.py` | DeepGemmFusedMoE | SM100/SM103 | FP8 Block Scales on Blackwell |
| `fused_moe_triton.py` | TritonFusedMoE | SM90 only | GPT-OSS on Hopper (requires `swiglu_gptoss_style=True`) |
| `fused_moe_cute_dsl.py` | CuteDslFusedMoE | SM100/SM103 | High throughput NVFP4, generally faster than Cutlass |
| `fused_moe_wide_ep.py` | WideEPMoE | All GPUs | Deprecating — use ConfigurableMoE instead |
| `fused_moe_vanilla.py` | VanillaMoE | All devices | Reference / debugging only |

### Communication (`fused_moe/communication/`)

Communication strategies are auto-selected at runtime by `CommunicationFactory` based on hardware and configuration. See `communication_factory.py` for selection logic and `base.py` for the `Communication` ABC.

### Tests (`tests/unittest/_torch/modules/moe/`)

| File | Tests | Status |
|------|-------|--------|
| `test_moe_backend.py` | Backend unit tests (run_moe, can_implement) | Active |
| `test_moe_module.py` | ConfigurableMoE integration tests (Backend × Comm × EPLB) | Active |
| `test_fused_moe.py` | Legacy moe tests | Being replaced, do NOT add new tests here |
| `test_moe.py` | Legacy TRTLLM backend tests | Being replaced, do NOT add new tests here |

## Backend Capability Matrix

### Quantization Support

Each backend's `can_implement(quant_algo, dtype_activation, swiglu_gptoss_style)` method declares supported quantizations. Source of truth: the `can_implement` classmethod in each backend file.

| Quantization | Cutlass | TRTLLMGen | DeepGemm | Triton | CuteDSL | WideEP | Vanilla |
|---|---|---|---|---|---|---|---|
| Unquantized (BF16/FP16) | Y (SM80+) | N | N | Y (SM90, BF16) | N | Y | Y |
| FP8 QDQ | Y (SM89+) | N | N | Y (SM90) | N | Y | Y |
| FP8 Block Scales | Y (SM90, SM120) | Y (SM100/103) | Y (SM100/103) | N | N | Y | Y |
| NVFP4 | Y (SM100/103/120/121) | Y (SM100/103) | N | N | Y (SM100/103) | Y | Y |
| W4A8 NVFP4 FP8 | N | Y (SM100/103) | N | N | N | N | N |
| W4A16 MXFP4 | Y (SM90) | Y (SM100/103) | N | Y (SM90) | N | N | N |
| W4A8 MXFP4 FP8 | Y (SM100/103) | Y (SM100/103) | N | Y (SM90) | N | N | N |
| W4A8 MXFP4 MXFP8 | Y (SM100/103) | Y (SM100/103) | N | N | N | N | N |
| W4A8 AWQ | Y (SM89/90) | N | N | N | N | N | N |
| W8A16 | Y (SM80+) | N | N | N | N | N | N |
| INT4 WoQ (W4AFP8) | N | N | N | N | N | Y | N |



## Canonical Examples

When adding new components, use these reference implementations:

| Task | Reference | Key methods to implement |
|------|-----------|------------------------|
| New Backend | `fused_moe_cutlass.py` (CutlassFusedMoE) | `can_implement`, `run_moe`, `create_weights`, `load_weights` |
| New Quantization Method | `quantization.py` → `FP8QDQFusedMoEMethod` | Subclass `FusedMoEMethod`, implement quant/dequant ops |
| New Communication Strategy | `communication/nvlink_one_sided.py` (NVLinkOneSided) | Subclass `Communication`, implement `prepare_dispatch`, `dispatch`, `combine` |
| Backend Tests | `test_moe_backend.py` | Follow existing parametrize patterns |
| Integration Tests | `test_moe_module.py` | Test Backend × Communication × EPLB combinations |

**Note on backend inheritance:** New backends should inherit from `MoE` (in `interface.py`), NOT from `CutlassFusedMoE`. Current backends inherit from `CutlassFusedMoE` as a historical shortcut to reuse infrastructure (load balancer, weight management, TP/EP). This will be refactored — a dedicated `MoEBackend` interface will be extracted.

## Anti-Patterns

- **Do NOT add communication logic inside backends** — Communication belongs in `communication/`, backends do pure computation
- **Do NOT modify old `XXFusedMoE` files for new features** — Use ConfigurableMoE + Backend architecture
- **Do NOT add new tests to `test_fused_moe.py` or `test_moe.py`** — Use `test_moe_backend.py` and `test_moe_module.py`
- **Do NOT skip `can_implement()` checks** — Every backend must declare what it supports; unsupported combos must return `(False, reason)`
Loading