Skip to content

Commit 5c31754

Browse files
committed
docs: add parallelism strategy selection guide with MoE-specific layouts
Add docs/skills/knowledge split for parallelism strategy selection, based on real recipe layouts from DeepSeek-V2/V3, Qwen3, Kimi-K2, etc. Key insight: MoE models size TP by active params (often TP=1-2) and use EP as the primary scaling dimension, unlike dense models. Verified TP=2+PP=2+SP on cluster with llama32_1b_pretrain_config. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
1 parent e397f32 commit 5c31754

File tree

3 files changed

+357
-0
lines changed

3 files changed

+357
-0
lines changed

docs/parallelisms.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,53 @@ For example, with 32 GPUs total and the configuration above:
435435
- `context_parallel_size = 2`
436436
- `data_parallel_size = 32 / (2 × 4 × 2) = 2`
437437

438+
## Strategy Selection Guide
439+
440+
Choosing the right combination depends on model size, hardware topology,
441+
and sequence length.
442+
443+
### Dense Models by Size
444+
445+
| Model size | GPUs | Recommended starting point |
446+
|---|---|---|
447+
| < 1B | 1-8 | DP only |
448+
| 1-10B | 8-16 | TP=2-4 + DP |
449+
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
450+
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
451+
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
452+
453+
### MoE Models
454+
455+
MoE models differ fundamentally from dense models: only a fraction of
456+
parameters are active per token, so TP can often stay at 1 or 2. EP is
457+
the primary scaling dimension.
458+
459+
| Total / active params | Typical layout |
460+
|---|---|
461+
| < 20B | EP only (TP=1, PP=1) |
462+
| 20-100B | TP=1-2 + PP=2-4 + EP=8-16 |
463+
| 100-500B | TP=2-4 + PP=8-16 + EP=8-32 |
464+
| 500B+ | TP=2 + PP=16 + EP=32-64 |
465+
466+
### By Hardware Topology
467+
468+
- **Single node with NVLink**: maximize TP within the node (up to TP=8).
469+
- **Multiple nodes with InfiniBand**: keep TP within a node, use PP across nodes.
470+
- **Limited network (Ethernet)**: minimize TP, prefer PP for cross-node scaling.
471+
472+
### By Sequence Length
473+
474+
| Sequence length | Recommendation |
475+
|---|---|
476+
| < 2K | standard TP + PP + DP |
477+
| 2K-8K | add SP (`sequence_parallel=True`) |
478+
| 8K-32K | add CP=2 |
479+
| 32K+ | add CP=4-8, consider hierarchical CP |
480+
481+
For operational details on configuring combined parallelism, troubleshooting
482+
layouts, and memory estimation, see the
483+
[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies.md).
484+
438485
## Configuration Guidelines
439486

440487
### Memory Optimization
@@ -458,6 +505,11 @@ For example, with 32 GPUs total and the configuration above:
458505
- **Token dropping** requires `alltoall` or `alltoall_seq` token dispatcher
459506
- All parallelism strategies can be combined, but total parallelism must divide evenly into the world size
460507

508+
## Related Artifacts
509+
510+
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
511+
- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
512+
461513
## Resources
462514

463515
- [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/)
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
title: parallelism_strategies
2+
validated_on: "2026-03-15"
3+
summary: >
4+
Megatron Bridge supports DP, TP, PP, SP, CP, and EP parallelism strategies
5+
which can be combined for models from sub-1B to 500B+ parameters. The right
6+
combination depends on model size, hardware topology, and sequence length.
7+
validation_status:
8+
dp_ddp_distributed_optimizer:
9+
- code_verified
10+
tp_config_and_runtime:
11+
- code_verified
12+
pp_interleaved_schedule:
13+
- code_verified
14+
sp_activation_partitioning:
15+
- code_verified
16+
cp_context_parallel:
17+
- code_verified
18+
ep_expert_parallel:
19+
- code_verified
20+
combined_parallelism_init:
21+
- code_verified
22+
sizing_heuristics:
23+
- doc_only
24+
feature_meaning:
25+
data_parallel: >
26+
Replicate model across GPUs, split data batches, synchronize gradients.
27+
tensor_parallel: >
28+
Split individual layer tensors across GPUs within a node.
29+
pipeline_parallel: >
30+
Assign consecutive layer groups to different GPUs, process microbatches
31+
in a pipeline.
32+
sequence_parallel: >
33+
Partition activations along the sequence dimension within TP groups to
34+
reduce activation memory.
35+
context_parallel: >
36+
Split long sequences across GPUs using ring attention or similar
37+
communication patterns.
38+
expert_parallel: >
39+
Distribute MoE experts across GPUs, only applies to expert layers.
40+
recommended_path:
41+
dense_under_1b: DP only
42+
dense_1b_to_10b: TP=2-4 + DP
43+
dense_10b_to_70b: TP=4-8 + PP=2-4 + DP
44+
dense_70b_to_175b: TP=8 + PP=4-8 + DP
45+
dense_175b_plus: TP=8 + PP=8-16 + CP=2 + DP
46+
moe_under_20b: EP only (TP=1, PP=1)
47+
moe_20b_to_100b: TP=1-2 + PP=2-4 + EP=8-16
48+
moe_100b_to_500b: TP=2-4 + PP=8-16 + EP=8-32
49+
moe_500b_plus: TP=2 + PP=16 + EP=32-64
50+
known_constraints:
51+
- TP should stay within a single NVLink domain for performance.
52+
- SP requires tensor_model_parallel_size > 1.
53+
- CP requires seq_length divisible by 2 * context_parallel_size.
54+
- EP requires num_moe_experts > 0 and expert_model_parallel_size divides num_moe_experts.
55+
- PP interleaved schedule requires virtual_pipeline_model_parallel_size > 1.
56+
- Total parallelism dimensions must divide evenly into world_size.
57+
known_limitations:
58+
- Model-size-to-parallelism mapping is a heuristic, not a benchmark-proven table.
59+
- Not every parallelism combination has the same level of in-repo functional test coverage.
60+
- Memory estimates assume standard Adam optimizer and FP16/BF16 parameters.
61+
evidence:
62+
- docs/parallelisms.md
63+
- docs/performance-guide.md
64+
- docs/training/communication-overlap.md
65+
- docs/training/hybrid-context-parallel.md
66+
- src/megatron/bridge/training/initialize.py
67+
- src/megatron/bridge/training/config.py
68+
- src/megatron/bridge/models/common/unimodal.py
69+
follow_up_validation:
70+
- Add a checked-in combined parallelism functional smoke for TP+PP+CP.
71+
- Add benchmark-backed sizing guidance for at least one model family.
72+
- Add explicit EP+TP+PP functional smoke for MoE models.
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
name: parallelism-strategies
3+
description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
4+
---
5+
6+
# Parallelism Strategy Selection Skill
7+
8+
For stable background on each parallelism type, see:
9+
10+
- `docs/parallelisms.md`
11+
- `knowledge/techniques/parallelism_strategies.yaml`
12+
13+
## Decision by Model Size
14+
15+
### Dense models
16+
17+
| Model size | GPUs | Recommended starting point |
18+
|---|---|---|
19+
| < 1B | 1-8 | DP only |
20+
| 1-10B | 8-16 | TP=2-4 + DP |
21+
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
22+
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
23+
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
24+
25+
### MoE models
26+
27+
MoE parallelism differs from dense models. Because only a fraction of
28+
parameters are active per token, TP can often stay at 1 or 2 — the active
29+
parameter shard already fits on a single GPU. EP is the primary scaling
30+
dimension, with PP handling cross-node layer distribution.
31+
32+
| Model (total / active) | TP | PP | EP | Notes |
33+
|---|---|---|---|---|
34+
| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node |
35+
| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers |
36+
| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all |
37+
| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all |
38+
| Qwen3 30B-A3B | 4 | 2 | 4 | |
39+
| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
40+
| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain |
41+
| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 |
42+
| Kimi-K2 1T | 2 | 16 | 32 | |
43+
44+
Key patterns:
45+
46+
- TP is sized by **active** params, not total params. A 671B MoE with
47+
37B active needs far less TP than a 70B dense model.
48+
- EP scales with expert count. Common: EP = num_experts or
49+
num_experts / experts_per_gpu.
50+
- PP handles depth. Large MoE models use PP=8-16 across nodes.
51+
- ETP (expert tensor parallelism) is rarely used. Llama 4 is an
52+
exception (ETP=4).
53+
54+
These are starting points, not hard rules. Always profile the first
55+
iteration to verify memory and communication.
56+
57+
## Decision by Hardware Topology
58+
59+
Single node with NVLink:
60+
61+
```python
62+
cfg.model.tensor_model_parallel_size = 8
63+
```
64+
65+
Multiple nodes with InfiniBand:
66+
67+
```python
68+
cfg.model.tensor_model_parallel_size = 8
69+
cfg.model.pipeline_model_parallel_size = N
70+
```
71+
72+
Limited network (Ethernet):
73+
74+
```python
75+
cfg.model.tensor_model_parallel_size = 4
76+
cfg.model.pipeline_model_parallel_size = M
77+
```
78+
79+
The stable rule is: keep TP within a single NVLink domain. Use PP or DP
80+
for cross-node scaling. TP across nodes is almost always a performance
81+
loss.
82+
83+
## Decision by Sequence Length
84+
85+
| Sequence length | Recommendation |
86+
|---|---|
87+
| < 2K | standard TP + PP + DP |
88+
| 2K-8K | add SP (`sequence_parallel=True`) |
89+
| 8K-32K | add CP=2 |
90+
| 32K+ | add CP=4-8, consider `a2a+p2p` for large CP |
91+
92+
## Combined Parallelism Enablement
93+
94+
3D parallelism (TP + PP + DP):
95+
96+
```python
97+
cfg.model.tensor_model_parallel_size = 4
98+
cfg.model.pipeline_model_parallel_size = 4
99+
cfg.model.sequence_parallel = True
100+
```
101+
102+
4D parallelism (TP + PP + CP + DP):
103+
104+
```python
105+
cfg.model.tensor_model_parallel_size = 8
106+
cfg.model.pipeline_model_parallel_size = 8
107+
cfg.model.context_parallel_size = 2
108+
cfg.model.sequence_parallel = True
109+
```
110+
111+
MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
112+
113+
```python
114+
cfg.model.tensor_model_parallel_size = 1
115+
cfg.model.pipeline_model_parallel_size = 4
116+
cfg.model.expert_model_parallel_size = 32
117+
cfg.model.sequence_parallel = False
118+
```
119+
120+
MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
121+
122+
```python
123+
cfg.model.tensor_model_parallel_size = 2
124+
cfg.model.pipeline_model_parallel_size = 16
125+
cfg.model.expert_model_parallel_size = 64
126+
cfg.model.sequence_parallel = True
127+
```
128+
129+
DP size is always implicit:
130+
131+
```
132+
data_parallel_size = world_size / (TP * PP * CP)
133+
```
134+
135+
## Memory Estimation
136+
137+
Without parallelism (70B model, FP16):
138+
139+
```
140+
parameters: 140 GB
141+
gradients: 140 GB
142+
optimizer states: 280 GB (Adam)
143+
activations: 48 GB (batch=1, seq=4K)
144+
total: 608 GB
145+
```
146+
147+
With TP=4, PP=4, DP=4 (64 GPUs):
148+
149+
```
150+
parameters: 8.75 GB per GPU
151+
gradients: 8.75 GB per GPU
152+
optimizer states: 17.50 GB per GPU
153+
activations: 3.00 GB per GPU
154+
total: ~38 GB per GPU
155+
```
156+
157+
## Code Anchors
158+
159+
Parallelism dimensions set in model provider:
160+
161+
```66:81:docs/parallelisms.md
162+
model_config = GPTModelProvider(
163+
tensor_model_parallel_size=2,
164+
# ... other model parameters
165+
)
166+
```
167+
168+
DP size calculation:
169+
170+
```424:436:docs/parallelisms.md
171+
data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)
172+
```
173+
174+
Bridge initialization wires parallelism into process groups:
175+
176+
```618:628:src/megatron/bridge/training/initialize.py
177+
parallel_state.initialize_model_parallel(
178+
tensor_model_parallel_size=model_config.tensor_model_parallel_size,
179+
pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
180+
...
181+
context_parallel_size=model_config.context_parallel_size,
182+
hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
183+
expert_model_parallel_size=model_config.expert_model_parallel_size,
184+
...
185+
)
186+
```
187+
188+
## Pitfalls
189+
190+
1. TP across nodes destroys throughput. Always keep TP within a single
191+
NVLink domain.
192+
193+
2. PP without interleaving has large pipeline bubbles. Use
194+
`virtual_pipeline_model_parallel_size` when possible.
195+
196+
3. SP requires `tensor_model_parallel_size > 1`. Enabling SP alone
197+
without TP is a config error.
198+
199+
4. CP requires `seq_length % (2 * context_parallel_size) == 0`.
200+
201+
5. EP is only for MoE models. Setting `expert_model_parallel_size` on a
202+
dense model is a no-op or error.
203+
204+
6. The model-size-to-parallelism table above is a starting heuristic.
205+
Always profile the first iteration to check memory and communication.
206+
207+
7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with
208+
overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`.
209+
210+
## Verification
211+
212+
Quick sanity check that combined parallelism initializes correctly using
213+
the smallest available recipe with overridden parallelism:
214+
215+
```bash
216+
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
217+
scripts/training/run_recipe.py \
218+
--recipe llama32_1b_pretrain_config \
219+
model.tensor_model_parallel_size=2 \
220+
model.pipeline_model_parallel_size=2 \
221+
model.sequence_parallel=True \
222+
train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
223+
scheduler.lr_warmup_iters=0 \
224+
validation.eval_iters=0 validation.eval_interval=0 \
225+
checkpoint.save_interval=0 \
226+
logger.log_interval=1
227+
```
228+
229+
Success criteria:
230+
231+
- exit code 0
232+
- finite loss at iteration 3 (e.g. `lm loss: 1.003808E+01`)
233+
- log shows TP=2 PP=2 DP=1 layout with 4 ranks

0 commit comments

Comments
 (0)