Skip to content

Commit ddbfbe5

Browse files
authored
[Docs] Clarify Expert Parallel behavior for attention and MoE layers (vllm-project#30615)
Signed-off-by: majiayu000 <[email protected]>
1 parent 763963a commit ddbfbe5

File tree

2 files changed

+23
-3
lines changed

2 files changed

+23
-3
lines changed

docs/serving/data_parallel_deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ For MoE models, particularly those like DeepSeek that employ MLA (Multi-head Lat
88

99
In these cases, the data parallel ranks are not completely independent. Forward passes must be aligned, and expert layers across all ranks are required to synchronize during every forward pass, even when there are fewer requests to be processed than DP ranks.
1010

11-
The expert layers will by default form a (DP x TP) sized tensor parallel group. To enable expert parallelism, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case).
11+
By default, expert layers form a tensor parallel group of size `DP × TP`. To use expert parallelism instead, include the `--enable-expert-parallel` CLI arg (on all nodes in the multi-node case). See [Expert Parallel Deployment](expert_parallel_deployment.md) for details on how attention and expert layers behave differently with EP enabled.
1212

1313
In vLLM, each DP rank is deployed as a separate "core engine" process that communicates with front-end process(es) via ZMQ sockets. Data Parallel attention can be combined with Tensor Parallel attention, in which case each DP engine owns a number of per-GPU worker processes equal to the configured TP size.
1414

15-
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form an EP or TP group of size (DP x TP).
15+
For MoE models, when any requests are in progress in any rank, we must ensure that empty "dummy" forward passes are performed in all ranks that don't currently have any requests scheduled. This is handled via a separate DP Coordinator process that communicates with all ranks, and a collective operation performed every N steps to determine when all ranks become idle and can be paused. When TP is used in conjunction with DP, expert layers form a group of size `DP × TP` (using either tensor parallelism by default, or expert parallelism if `--enable-expert-parallel` is set).
1616

1717
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
1818

docs/serving/expert_parallel_deployment.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,27 @@ Where:
4444
- `DP_SIZE`: Data parallel size
4545
- `EP_SIZE`: Expert parallel size (computed automatically)
4646

47-
When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
47+
### Layer Behavior with EP Enabled
48+
49+
When EP is enabled, different layers in MoE models behave differently:
50+
51+
| Layer Type | Behavior | Parallelism Used |
52+
|------------|----------|------------------|
53+
| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
54+
| **Attention Layers** | Behavior depends on TP size | See below |
55+
56+
**Attention layer parallelism:**
57+
58+
- **When `TP = 1`**: Attention weights are **replicated** across all DP ranks (data parallelism)
59+
- **When `TP > 1`**: Attention weights are **sharded** using tensor parallelism across TP ranks within each DP group
60+
61+
For example, with `TP=2, DP=4` (8 GPUs total):
62+
63+
- Expert layers form an EP group of size 8, with experts distributed across all GPUs
64+
- Attention layers use TP=2 within each of the 4 DP groups
65+
66+
!!! note "Key Difference from Data Parallel Deployment"
67+
Without `--enable-expert-parallel`, MoE layers would use tensor parallelism (forming a TP group of size `TP × DP`), similar to dense models. With EP enabled, expert layers switch to expert parallelism, which can provide better efficiency and locality for MoE models.
4868

4969
### Example Command
5070

0 commit comments

Comments
 (0)